Wikipedia:Bots/Requests for approval/ScannerBot: Difference between revisions
0xDeadbeef (talk | contribs) →Discussion: Reply |
→Discussion: Reply |
||
Line 69: | Line 69: | ||
::::::But dots still need escaping?  <span style="font-variant:small-caps; whitespace:nowrap;">[[User:Headbomb|Headbomb]] {[[User talk:Headbomb|t]] · [[Special:Contributions/Headbomb|c]] · [[WP:PHYS|p]] · [[WP:WBOOKS|b]]}</span> 01:56, 17 May 2022 (UTC) |
::::::But dots still need escaping?  <span style="font-variant:small-caps; whitespace:nowrap;">[[User:Headbomb|Headbomb]] {[[User talk:Headbomb|t]] · [[Special:Contributions/Headbomb|c]] · [[WP:PHYS|p]] · [[WP:WBOOKS|b]]}</span> 01:56, 17 May 2022 (UTC) |
||
:::::::Yes because {{code|.}} and {{code|\.}} have different meanings in regex. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 02:30, 17 May 2022 (UTC) |
:::::::Yes because {{code|.}} and {{code|\.}} have different meanings in regex. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 02:30, 17 May 2022 (UTC) |
||
::::::::I know. Just surprised one needs escaping and the other doesn't. Not important, if the code works, it works.  <span style="font-variant:small-caps; whitespace:nowrap;">[[User:Headbomb|Headbomb]] {[[User talk:Headbomb|t]] · [[Special:Contributions/Headbomb|c]] · [[WP:PHYS|p]] · [[WP:WBOOKS|b]]}</span> 10:24, 17 May 2022 (UTC) |
|||
: You'll want to detect primary URLs, or skip archive URLs, changing those will break them. Archive URLs can be 20+ types, it's probably easiest to detect if the twitter URL starts with "/" (example in [[Brandon Clarke]]). -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 16:15, 14 May 2022 (UTC) |
: You'll want to detect primary URLs, or skip archive URLs, changing those will break them. Archive URLs can be 20+ types, it's probably easiest to detect if the twitter URL starts with "/" (example in [[Brandon Clarke]]). -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 16:15, 14 May 2022 (UTC) |
||
::Yeah, I should probably match {{code|[^/]}} or <code><nowiki>[\s=>]</nowiki></code> for it to be primary. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 02:07, 15 May 2022 (UTC) |
::Yeah, I should probably match {{code|[^/]}} or <code><nowiki>[\s=>]</nowiki></code> for it to be primary. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 02:07, 15 May 2022 (UTC) |
Revision as of 10:24, 17 May 2022
New to bots on Wikipedia? Read these primers!
- Approval process – How this discussion works
- Overview/Policy – What bots are/What they can (or can't) do
- Dictionary – Explains bot-related jargon
Operator: 0xDeadbeef (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 01:48, Thursday, May 5, 2022 (UTC)
Function overview: Removes tracker tags in Twitter links.
Automatic, Supervised, or Manual: Automatic
Programming language(s): Python
Source code available: gist
Links to relevant discussions (where appropriate):
Edit period(s): One time run
Estimated number of pages affected: Probably 10000+
Namespace(s): Mainspace
Exclusion compliant (Yes/No): Yes
Function details: Finds twitter.com URLs and remove parameters named as s
or t
.
Discussion
Comments before change
|
---|
Comment: if a bot account is needed, I will probably use ScannerBot. 0xDEADBEEF (T C) 01:51, 5 May 2022 (UTC)
|
- Note: The functionality and the scope of the bot was made more specific. See page history for more details. 0xDeadbeef (T C) 06:28, 14 May 2022 (UTC)
- Regex? Primefac (talk) 15:13, 14 May 2022 (UTC)
- @Primefac: You can look at the gist I linked.
https://twitter\.com/\w+/status/\d+\?[^\s}<|]+
is used to match the URL, and then urllib is used to parse, and then remove the parameters. 0xDeadbeef (T C) 15:19, 14 May 2022 (UTC)- You'll likely want
https:\/\/twitter\.com\/\w+\/status\/\d+\?[^\s}<|]+
for regex, to escape the/
characters. (Same for below). Headbomb {t · c · p · b} 01:13, 17 May 2022 (UTC)- I embedded the regex as a Python raw string which does not need to escape forward slashes. 0xDeadbeef (T C) 01:17, 17 May 2022 (UTC)
- But dots still need escaping? Headbomb {t · c · p · b} 01:56, 17 May 2022 (UTC)
- Yes because
.
and\.
have different meanings in regex. 0xDeadbeef (T C) 02:30, 17 May 2022 (UTC)
- Yes because
- But dots still need escaping? Headbomb {t · c · p · b} 01:56, 17 May 2022 (UTC)
- I embedded the regex as a Python raw string which does not need to escape forward slashes. 0xDeadbeef (T C) 01:17, 17 May 2022 (UTC)
- You'll likely want
- @Primefac: You can look at the gist I linked.
- Regex? Primefac (talk) 15:13, 14 May 2022 (UTC)
- You'll want to detect primary URLs, or skip archive URLs, changing those will break them. Archive URLs can be 20+ types, it's probably easiest to detect if the twitter URL starts with "/" (example in Brandon Clarke). -- GreenC 16:15, 14 May 2022 (UTC)
- Yeah, I should probably match
[^/]
or[\s=>]
for it to be primary. 0xDeadbeef (T C) 02:07, 15 May 2022 (UTC)- Great, thanks. Also WebCite like
https://www.webcitation.org/6d0sXMyOT?url=https://twitter.com
.. couple others use?url=
vs. "/" as the break point. -- GreenC 03:12, 15 May 2022 (UTC)- @GreenC: Hmm, then it would be hard to distinguish a template parameter from a URL parameter in an URL...
{{Foo|1=https://twitter.com}}
https://www.webcitation.org/6d0sXMyOT?url=https://twitter.com
0xDeadbeef (T C) 04:03, 15 May 2022 (UTC)- Right, I can't say what the regex would be. One method is match every string "/https?://twitter" and convert to "__hidestring__" (same with "?url=") - and when done convert those hidden strings back before saving the article. The "__hidestring__" might be "__hidestring-fs-http__" or "__hidestring-fs-https__" so you know how to revert back. Or really best, save the literal string in a table and the hidden string is the table identifier so it be restored. That way it can match on "/https?://(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\\-]*[a-zA-Z0-9])[.])*twitter" which will capture all hostname(s) such as "/http://beta.twitter" -- GreenC 17:33, 15 May 2022 (UTC)
- Okay I used a negative lookbehind and you can look at the tests here: https://regexr.com/6lmgl 0xDeadbeef (T C) 23:18, 15 May 2022 (UTC)
(?<!\?url=|/|cache:)https://twitter\.com/\w+/status/\d+/?\?[^\s}<|]+
0xDeadbeef (T C) 04:25, 16 May 2022 (UTC)- Nice. There is also sometimes very rarely protocol relative (WP:PRURL) eg.
{{cite web |url=//twitter.com}}
. They are so uncommon and can be tricky it would probably be OK to skip or log them if it doesn't fit with the regex. -- GreenC 05:21, 16 May 2022 (UTC)- a quick search seems to show that it is fine. I've fixed all three that appeared from that search. 0xDeadbeef (T C) 06:52, 16 May 2022 (UTC)
- Nice. There is also sometimes very rarely protocol relative (WP:PRURL) eg.
- Right, I can't say what the regex would be. One method is match every string "/https?://twitter" and convert to "__hidestring__" (same with "?url=") - and when done convert those hidden strings back before saving the article. The "__hidestring__" might be "__hidestring-fs-http__" or "__hidestring-fs-https__" so you know how to revert back. Or really best, save the literal string in a table and the hidden string is the table identifier so it be restored. That way it can match on "/https?://(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\\-]*[a-zA-Z0-9])[.])*twitter" which will capture all hostname(s) such as "/http://beta.twitter" -- GreenC 17:33, 15 May 2022 (UTC)
- Great, thanks. Also WebCite like
- Yeah, I should probably match