Wikipedia:Bots/Requests for approval/ScannerBot: Difference between revisions

Note: This bot appears to have edited since this BRFA was filed. Bots may not edit outside their own or their operator's userspace unless approved or approved for trial. AnomieBOT ⚡ 10:53, 5 May 2022 (UTC) — AnomieBOT (talk • contribs) has made few or no other edits outside this topic. [reply]
Note: This bot has edited its own BRFA page. Bot policy states that the bot account is only for edits on approved tasks or trials approved by BAG; the operator must log into their normal account to make any non-bot edits. AnomieBOT⚡ 11:40, 5 May 2022 (UTC)[reply]
- 0xDEADBEEF (T C) 11:43, 5 May 2022 (UTC)[reply]
I'm not entirely sure how much I want to be commenting with my BAG hat on, but based on previous tasks that were approved I am not convinced that as a bot task this is fully formed yet. Based on the supposed list of URLs where this tracking is located, the scanner isn't working right either, because there are a few false positives that I know exist out there that are not on the list. If 0xDeadbeef wants to use JWB on their main account they are welcome to and do not require BAG approval. On that note, though, I have moved this BRFA to the bot's page to make it officially a BRFA. Primefac (talk) 14:41, 7 May 2022 (UTC)[reply]
And, on a minor note, this has prompted me to run Task 17 again... Primefac (talk) 14:49, 7 May 2022 (UTC)[reply]

I didn't have a method for determining that they are actually parameters of an URL. I tested with a python script that just matched on keywords within the source. I didn't know that there were previous tasks. I will take a look at those and perhaps amend the regex to match more parameters. 0xDEADBEEF (T C) 02:30, 8 May 2022 (UTC)[reply]
\??(?:&?(?:fbclid|yclid|tracking_referrer|referrer(?:_access_token)?|gs_l|dclid|_ga|_gl|fb_(?:source|ref)|ref_)=[^&\s\]\|]*?)+(?=<|}|]|\s|\|)|(?<=\?)(?:&?(?:fbclid|yclid|tracking_referrer|referrer(?:_access_token)?|gs_l|dclid|_ga|_gl|fb_(?:source|ref)|ref_)=[^&\s\]\|]*)+&|(?<=&)(?:&?(?:fbclid|yclid|tracking_referrer|referrer(?:_access_token)?|gs_l|dclid|_ga|_gl|fb_(?:source|ref)|ref_)=[^&\s\]\|]*)+& 0xDEADBEEF (T C) 02:40, 8 May 2022 (UTC)[reply]

Based on the supposed list of URLs where this tracking is located, the scanner isn't working right either: For the record: I didn't know that CirrusSearch allowed regex searching so I used pywikibot. Now I will probably use insource:/.../ to generate list of articles to fix, with JWB. 0xDEADBEEF (T C) 04:06, 8 May 2022 (UTC)[reply]

Note: The functionality and the scope of the bot was made more specific. See page history for more details. 0xDeadbeef (T C) 06:28, 14 May 2022 (UTC)[reply]

Regex? Primefac (talk) 15:13, 14 May 2022 (UTC)[reply]

@Primefac: You can look at the gist I linked. https://twitter\.com/\w+/status/\d+\?[^\s}<|]+ is used to match the URL, and then urllib is used to parse, and then remove the parameters. 0xDeadbeef (T C) 15:19, 14 May 2022 (UTC)[reply]

You'll likely want https:\/\/twitter\.com\/\w+\/status\/\d+\?[^\s}<|]+ for regex, to escape the / characters. (Same for below). Headbomb {t · c · p · b} 01:13, 17 May 2022 (UTC)[reply]

I embedded the regex as a Python raw string which does not need to escape forward slashes. 0xDeadbeef (T C) 01:17, 17 May 2022 (UTC)[reply]

But dots still need escaping? Headbomb {t · c · p · b} 01:56, 17 May 2022 (UTC)[reply]

Yes because . and \. have different meanings in regex. 0xDeadbeef (T C) 02:30, 17 May 2022 (UTC)[reply]

I know. Just surprised one needs escaping and the other doesn't. Not important, if the code works, it works. Headbomb {t · c · p · b} 10:24, 17 May 2022 (UTC)[reply]

You'll want to detect primary URLs, or skip archive URLs, changing those will break them. Archive URLs can be 20+ types, it's probably easiest to detect if the twitter URL starts with "/" (example in Brandon Clarke). -- GreenC 16:15, 14 May 2022 (UTC)[reply]

Yeah, I should probably match [^/] or [\s=>] for it to be primary. 0xDeadbeef (T C) 02:07, 15 May 2022 (UTC)[reply]

Great, thanks. Also WebCite like https://www.webcitation.org/6d0sXMyOT?url=https://twitter.com .. couple others use ?url= vs. "/" as the break point. -- GreenC 03:12, 15 May 2022 (UTC)[reply]

@GreenC: Hmm, then it would be hard to distinguish a template parameter from a URL parameter in an URL...

{{Foo|1=https://twitter.com}}

https://www.webcitation.org/6d0sXMyOT?url=https://twitter.com 0xDeadbeef (T C) 04:03, 15 May 2022 (UTC)[reply]

Right, I can't say what the regex would be. One method is match every string "/https?://twitter" and convert to "__hidestring__" (same with "?url=") - and when done convert those hidden strings back before saving the article. The "__hidestring__" might be "__hidestring-fs-http__" or "__hidestring-fs-https__" so you know how to revert back. Or really best, save the literal string in a table and the hidden string is the table identifier so it be restored. That way it can match on "/https?://(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\\-]*[a-zA-Z0-9])[.])*twitter" which will capture all hostname(s) such as "/http://beta.twitter" -- GreenC 17:33, 15 May 2022 (UTC)[reply]

Okay I used a negative lookbehind and you can look at the tests here: https://regexr.com/6lmgl 0xDeadbeef (T C) 23:18, 15 May 2022 (UTC)[reply]

(?<!\?url=|/|cache:)https://twitter\.com/\w+/status/\d+/?\?[^\s}<|]+ 0xDeadbeef (T C) 04:25, 16 May 2022 (UTC)[reply]

Nice. There is also sometimes very rarely protocol relative (WP:PRURL) eg. {{cite web |url=//twitter.com}}. They are so uncommon and can be tricky it would probably be OK to skip or log them if it doesn't fit with the regex. -- GreenC 05:21, 16 May 2022 (UTC)[reply]

a quick search seems to show that it is fine. I've fixed all three that appeared from that search. 0xDeadbeef (T C) 06:52, 16 May 2022 (UTC)[reply]

@@ Line 69: / Line 69: @@
 ::::::But dots still need escaping? &#32;<span style="font-variant:small-caps; whitespace:nowrap;">[[User:Headbomb|Headbomb]] {[[User talk:Headbomb|t]] · [[Special:Contributions/Headbomb|c]] · [[WP:PHYS|p]] · [[WP:WBOOKS|b]]}</span> 01:56, 17 May 2022 (UTC)
 :::::::Yes because {{code|.}} and {{code|\.}} have different meanings in regex. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 02:30, 17 May 2022 (UTC)
+::::::::I know. Just surprised one needs escaping and the other doesn't. Not important, if the code works, it works. &#32;<span style="font-variant:small-caps; whitespace:nowrap;">[[User:Headbomb|Headbomb]] {[[User talk:Headbomb|t]] · [[Special:Contributions/Headbomb|c]] · [[WP:PHYS|p]] · [[WP:WBOOKS|b]]}</span> 10:24, 17 May 2022 (UTC)
 : You'll want to detect primary URLs, or skip archive URLs, changing those will break them.  Archive URLs can be 20+ types, it's probably easiest to  detect if the twitter URL starts with "/" (example in [[Brandon Clarke]]). -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 16:15, 14 May 2022 (UTC)
 ::Yeah, I should probably match {{code|[^/]}} or <code><nowiki>[\s=>]</nowiki></code> for it to be primary. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 02:07, 15 May 2022 (UTC)