Wikipedia:Bots/Requests for approval/Uncle G's major work 'bot

Uncle G's major work 'bot

There's a possibility that Uncle G's major work 'bot (talk · contribs) might wake up to do some major work. In this case it is the mass blanking of roughly ten thousand articles, to help address Wikipedia:Contributor copyright investigations/Darius Dhlomo.

Things are currently still at the discussion stage. Here's the background reading:

Main discussion page: Wikipedia:Administrators' noticeboard/Incidents/CCI
List of articles to be touched: As listed in earlier revisions of Wikipedia:Contributor copyright investigations/Darius Dhlomo and Wikipedia:Contributor copyright investigations/Darius Dhlomo 2, the list as supplied by VernoWhitney (talk · contribs). The current list is the articles touched by Darius Dhlomo. The original list was the articles created by xem.
Notice that the 'bot will blank each article with: Wikipedia:Contributor copyright investigations/Darius Dhlomo/Notice
Full task explanation, linked to by the 'bot's edit summaries: Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation (piped to something like "What this 'bot is doing.")
Further information for editors, linked to from the blanking notice: Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help

Points discussed and to be discussed:

I'm in favour of the template being outside of Template: namespace and in the project namespace, so that Wikipedia mirrors don't mirror the notice. But there are arguments to the contrary. Please discuss at the main discussion page.
If this goes ahead, I'm going to be using the same rate limits and whatnot that I used when moving VFD to AFD.
The 'bot doesn't have the flag. One could argue that ten thousand or so article blankings will light up a lot of watchlists. But drawing people's attention to a copyright problem with their watched articles is partly the desired outcome.

As I said, things are at the discussion stage. But with this sort of major work I want many people to be forewarned about this. There are currently big unsubtle notices on the Village Pump, the Content Noticeboard, the Administrators' Noticeboard, the 'Bot owners' Noticeboard, and the Centralized Discussion template. Feel free to notify anyone else that you think this misses.

I've tested the 'bot on Ted Morgan (boxer) (which was an uncontroversial article to test on, since it is a definite copyright violation of this biography). You can see the edit here. That's what's going to happen, and that's what it's going to look like. I might tweak the edit summary text a bit.

The updated 'bot, with a revised edit summary, and with some additional capabilities, has made two more test edits. Uncle G (talk) 14:35, 11 September 2010 (UTC)[reply]

Operator: Uncle G (talk) 16:11, 8 September 2010 (UTC)[reply]

Discussion

If you have an opinion on the task, or a better way to do it, please contribute to Wikipedia:Administrators' noticeboard/Incidents/CCI. That's where everyone else is having the discussion. They won't be paying much attention here. ☺ Uncle G (talk) 16:11, 8 September 2010 (UTC)[reply]

I have a few unrelated technical questions:

Is this going to be done with code successfully used in the past, or is this new code? If the latter, I'd want to throw a few test pages at it just to double check that things won't blow up.
What exactly are the proposed rate limits? If your bot can handle maxlag, I could certainly support a WP:IAR of the limits in WP:BOTPOL in favor of "as fast as maxlag allows" for this particular task. If you would want to do that, it should of course be discussed at Wikipedia:Administrators' noticeboard/Incidents/CCI.
Note that even if the bot has the bot flag, it is now possible for the bot to not flag its individual edits as bot. Even when not applied to edits, the bot flag still gives some advantages to the bot account that may be useful. Can your bot do this? If you want to test it, you can ask for the flag at WP:BN and do some edits in a userspace sandbox.

Anomie ⚔ 17:22, 8 September 2010 (UTC)[reply]

Heh! This is code so old that it predates both api.php and maxlag. (Successful past use includes raking various sandboxes, archiving my talk page, and of course moving that VFD mountain.) Right at the moment I'm working on checking it through and updating it to use api.php where appropriate (and where necessary — index.php functionality has changed since I last used some of the programs.). If you think that I wasn't going to throw a few test pages at it before throwing ten thousand pages at it live … ☺

Since the 'bot tools predate it, there's no maxlag. (There's no automated retry logic at all. If an edit fails, it fails.) My very simplistic approach to rate limiting was a hardwired delay of a fixed number of seconds between each operation. If you go back to 2005 in the contributions history, you'll see the delay in operation fairly clearly.

As for the flag: That's a discussion for other people, really. It doesn't affect the operation of any of the things that this 'bot will be doing. There are no queries involved, for instance. (I didn't even have a query-making tool until just recently, when I wrote one to perform an external link query for the GeoCities cleanup discussion.) Uncle G (talk) 22:31, 8 September 2010 (UTC)[reply]

Let's try some more interesting tests. Please feed it the list at [1]. The idea, of course, is to see whether the bot will blindly follow or blank redirects or recreate deleted pages should some of the later entries in the real list have been messed with by uninvolved humans by the time the bot gets to them. Anomie⚔ 12:18, 9 September 2010 (UTC)[reply]
- Redirects aren't really going to be an issue in the first place. The article list to be used contains only real articles with content. I'll try the second test, though. The 'bot's been told to pass the "nocreate" parameter, so I'm interested in seeing whether that actually works. Uncle G (talk) 12:32, 9 September 2010 (UTC)[reply]
  - There's interesting! It doesn't. I wonder why. It's doing what's in the doco. I'm going to have a play. Uncle G (talk) 12:44, 9 September 2010 (UTC)[reply]
  - Found the problem. The doco is wrong. The 'bot is now successfully unable to create User:Anomie/Sandbox10. Uncle G (talk) 12:57, 9 September 2010 (UTC)[reply]
    - Redirects could be an issue if someone comes along and redirects one of the copyvio articles after the list is finalized and handed to the bot. Anomie⚔ 14:48, 9 September 2010 (UTC)[reply]
      - Not really. The 'bot won't follow the redirect, and it's the edit history in the original place that would have the copyright violation, if any, in it. Furthermore, if an article is redirected it might have been merged. Investigation of whether that has happened is definitely warranted in such a situation. Triggering humans to perform such investigation is after all the purpose here.
        The only other situation would be a page move. This is highly unlikely, given both the topics of the articles concerned and the and recent creation of the list. As I recall, there's an edit filter that catches redirects turning into articles. (Checks. It's edit filter #342.) So any tagging of renamed articles, in the highly unlikely event of that happening, should be easily spottable by just looking which of the 'bot's edits triggered that edit filter. Uncle G (talk) 15:12, 9 September 2010 (UTC)[reply]
Due to the huge volume of articles that this will leave to be rescued, effective sorting will be vital to avoid overwhelming the rescuers and giving them some sort of priority list to work from. To help in this, can you please NOT delete the categories. They are not copyvios, there is not (AFAIK) any debate over whether the articles are correctly categorised, so leaving the cats shouldn't be a problem, unless the bot would find it hard to do? Interwiki links is another possible thing to leave behind, although they can still of course be retrieved from the history, and having live interwikis isn't useful from a sorting point of view.The-Pope (talk) 03:25, 11 September 2010 (UTC)[reply]
- Seconded, I was about to suggest the same thing, after someone else brought it up on the CCI page. Interwiki should also be preserved. Added: interwiki really should be preserved, since if it's not, various interwiki bots will both try to restore it and/or also potentially clobber interwiki info on other wikis as the enwiki info is removed. 75.62.3.153 (talk) 04:32, 11 September 2010 (UTC)[reply]
  - See the main task discussion page, where this was originally discussed at greater length. Uncle G (talk) 12:53, 11 September 2010 (UTC)[reply]
The proposed bot is making such a substantive change to Wikipedia that it would be nice if the bot's name indicated its function. Any reason a new account can't be created? One of the stated goals of all this is to draw editors' attention to the need for review, and "uncle g's major work 'bot" sounds tongue-in-cheek and easy to ignore (as I tend to ignore "cute botname x" quite regularly). A more formal name would make these rather drastic changes appear more "official" (consensus and all that). I would not ignore "2010 Mass Reversion of Copyright Infringements", but then Uncle G doesn't get the limelight (I'm not being facetious). Riggr Mortis (talk) 05:12, 11 September 2010 (UTC)[reply]
- "Uncle G's Mass Reversion of Copyright Infringements" is ok with me (shrug). The edit summary should also explain. 67.119.12.106 (talk) 06:30, 11 September 2010 (UTC)[reply]
  - The edit summary does explain. See the example edit. As you can see, I've historically to put names for the tasks (VFDTOAFD, IFDTOFFD, and so forth) in the 'bot's edits, and for this task it's no different. There's discussion on the main discussion page for tweaking the summary a bit more. But it does already contain a clear naming of the task, the edit, and a hyperlink to a detailed explanation. I had that right from the start. ☺ Uncle G (talk) 12:53, 11 September 2010 (UTC)[reply]
- I name my 'bots after the non-'bot account to make them unambiguously clear that they are my accounts. That is something that I'm very averse to changing. I'm strongly in favour of 'bot owners being clearly identifiable, having myself been in the position of having to work out who owns what in years gone by, and this is the way that I make it wholly clear who owns my 'bots. I refer everyone to Wikipedia:Bot policy on this, too. I am not going to remove this clear link between my 'bots and me over concerns about seeking "limelight" (not that I think you meant it that way). The policy requirements, which are in line with my own views on the matter of clearly identifiable ownership and responsibility, outweigh that.
  And if anyone reading what Riggr Mortis wrote thinks that I want the attention and hassle from outraged editors who, despite the notices everywhere on noticeboards, WikiProjects, and watchlists, still find out about this after it begins, that I know is going to come from 10,000 article blankings … In my ideal world none of this would have happened and I'd be peacefully rescuing Nephew and niece (AfD discussion). (Actually, in a truly ideal world, there's be no need for rescue.) Make no mistake: this is not the sort of attention or limelight that anyone wants to seek.
  Plus, of course, I don't want to imply any pretensions to "official" status, whatever people might think that to be. This is me, running my 'bot. I'm a 'bot owner, volunteering to do a necessary task with the tools at my disposal, via an ordinary editing account. Yes, we're talking about copyright cleanup and this is a CCI case. But this isn't a Developer running a task or an Office Action. Uncle G (talk) 12:53, 11 September 2010 (UTC)[reply]
  - You're entitled to the recognition for this work if you want it.
    I see you've got the categories taken care of--nice! Could you do an article with interwikis? I remember Rocío Ríos had a couple.
    A couple of minor category-related issues: 1) I see that some categories in the original articles were transcluded through templates, so the trick of reading the categories from the API and re-injecting them to the wikitext loses a little information. I guess that's ok under the circumstances, but should be mentioned in the instructions. 2) The API has a limit of 50 categories or interwikis per query IIRC, so in the unlikely event that one of these articles is in more than 50 categories or languages, it will take multiple queries to get them all. If that's too much implementation hassle for such an improbable occurrence, maybe you could just log a message (either in the article itself, or on the client side) identifying anyplace that happened, then handle it afterwards (or stop the bot if it happens more than a few times). That should be tested with a concocted sandbox article before the full run. 67.119.12.106 (talk) 00:04, 12 September 2010 (UTC)[reply]