Wikipedia:Bots/Requests for approval/Uncle G's major work 'bot

This CCI case
CCI pages
	CCI case main page; 'bot task explanation; how to help; 'bot discussion; cleanup discussion; changes to the 10,000 articles; list of tagged articles
Policy
	Copyright policy;
On this page

Uncle G's major work 'bot

There's a possibility that Uncle G's major work 'bot (talk · contribs) might wake up to do some major work. In this case it is the mass blanking of roughly ten thousand articles, to help address Wikipedia:Contributor copyright investigations/Darius Dhlomo.

Things are currently still at the discussion stage. Here's the background reading:

Main discussion page: Wikipedia:Administrators' noticeboard/Incidents/CCI
List of articles to be touched: As listed in earlier revisions of Wikipedia:Contributor copyright investigations/Darius Dhlomo and Wikipedia:Contributor copyright investigations/Darius Dhlomo 2, the list as supplied by VernoWhitney (talk · contribs). The current list is the articles touched by Darius Dhlomo. The original list was the articles created by xem.
Notice that the 'bot will blank each article with: Wikipedia:Contributor copyright investigations/Darius Dhlomo/Notice
Full task explanation, linked to by the 'bot's edit summaries: Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation (piped to something like "What this 'bot is doing.")
Further information for editors, linked to from the blanking notice: Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help

Points discussed and to be discussed:

I'm in favour of the template being outside of Template: namespace and in the project namespace, so that Wikipedia mirrors don't mirror the notice. But there are arguments to the contrary. Please discuss at the main discussion page.
If this goes ahead, I'm going to be using the same rate limits and whatnot that I used when moving VFD to AFD.
The 'bot doesn't have the flag. One could argue that ten thousand or so article blankings will light up a lot of watchlists. But drawing people's attention to a copyright problem with their watched articles is partly the desired outcome.

As I said, things are at the discussion stage. But with this sort of major work I want many people to be forewarned about this. There are currently big unsubtle notices on the Village Pump, the Content Noticeboard, the Administrators' Noticeboard, the 'Bot owners' Noticeboard, and the Centralized Discussion template. Feel free to notify anyone else that you think this misses.

I've tested the 'bot on Ted Morgan (boxer) (which was an uncontroversial article to test on, since it is a definite copyright violation of this biography). You can see the edit here. That's what's going to happen, and that's what it's going to look like. I might tweak the edit summary text a bit.

The updated 'bot, with a revised edit summary, and with some additional capabilities, has made two more test edits. Uncle G (talk) 14:35, 11 September 2010 (UTC)[reply]

Operator: Uncle G (talk) 16:11, 8 September 2010 (UTC)[reply]

Discussion

If you have an opinion on the task, or a better way to do it, please contribute to Wikipedia:Administrators' noticeboard/Incidents/CCI. That's where everyone else is having the discussion. They won't be paying much attention here. ☺ Uncle G (talk) 16:11, 8 September 2010 (UTC)[reply]

I have a few unrelated technical questions:

Is this going to be done with code successfully used in the past, or is this new code? If the latter, I'd want to throw a few test pages at it just to double check that things won't blow up.
What exactly are the proposed rate limits? If your bot can handle maxlag, I could certainly support a WP:IAR of the limits in WP:BOTPOL in favor of "as fast as maxlag allows" for this particular task. If you would want to do that, it should of course be discussed at Wikipedia:Administrators' noticeboard/Incidents/CCI.
Note that even if the bot has the bot flag, it is now possible for the bot to not flag its individual edits as bot. Even when not applied to edits, the bot flag still gives some advantages to the bot account that may be useful. Can your bot do this? If you want to test it, you can ask for the flag at WP:BN and do some edits in a userspace sandbox.

Anomie ⚔ 17:22, 8 September 2010 (UTC)[reply]

Heh! This is code so old that it predates both api.php and maxlag. (Successful past use includes raking various sandboxes, archiving my talk page, and of course moving that VFD mountain.) Right at the moment I'm working on checking it through and updating it to use api.php where appropriate (and where necessary — index.php functionality has changed since I last used some of the programs.). If you think that I wasn't going to throw a few test pages at it before throwing ten thousand pages at it live … ☺

Since the 'bot tools predate it, there's no maxlag. (There's no automated retry logic at all. If an edit fails, it fails.) My very simplistic approach to rate limiting was a hardwired delay of a fixed number of seconds between each operation. If you go back to 2005 in the contributions history, you'll see the delay in operation fairly clearly.

As for the flag: That's a discussion for other people, really. It doesn't affect the operation of any of the things that this 'bot will be doing. There are no queries involved, for instance. (I didn't even have a query-making tool until just recently, when I wrote one to perform an external link query for the GeoCities cleanup discussion.) Uncle G (talk) 22:31, 8 September 2010 (UTC)[reply]

Let's try some more interesting tests. Please feed it the list at [1]. The idea, of course, is to see whether the bot will blindly follow or blank redirects or recreate deleted pages should some of the later entries in the real list have been messed with by uninvolved humans by the time the bot gets to them. Anomie⚔ 12:18, 9 September 2010 (UTC)[reply]
- Redirects aren't really going to be an issue in the first place. The article list to be used contains only real articles with content. I'll try the second test, though. The 'bot's been told to pass the "nocreate" parameter, so I'm interested in seeing whether that actually works. Uncle G (talk) 12:32, 9 September 2010 (UTC)[reply]
  - There's interesting! It doesn't. I wonder why. It's doing what's in the doco. I'm going to have a play. Uncle G (talk) 12:44, 9 September 2010 (UTC)[reply]
  - Found the problem. The doco is wrong. The 'bot is now successfully unable to create User:Anomie/Sandbox10. Uncle G (talk) 12:57, 9 September 2010 (UTC)[reply]
    - Redirects could be an issue if someone comes along and redirects one of the copyvio articles after the list is finalized and handed to the bot. Anomie⚔ 14:48, 9 September 2010 (UTC)[reply]
      - Not really. The 'bot won't follow the redirect, and it's the edit history in the original place that would have the copyright violation, if any, in it. Furthermore, if an article is redirected it might have been merged. Investigation of whether that has happened is definitely warranted in such a situation. Triggering humans to perform such investigation is after all the purpose here.
        The only other situation would be a page move. This is highly unlikely, given both the topics of the articles concerned and the and recent creation of the list. As I recall, there's an edit filter that catches redirects turning into articles. (Checks. It's edit filter #342.) So any tagging of renamed articles, in the highly unlikely event of that happening, should be easily spottable by just looking which of the 'bot's edits triggered that edit filter. Uncle G (talk) 15:12, 9 September 2010 (UTC)[reply]
Due to the huge volume of articles that this will leave to be rescued, effective sorting will be vital to avoid overwhelming the rescuers and giving them some sort of priority list to work from. To help in this, can you please NOT delete the categories. They are not copyvios, there is not (AFAIK) any debate over whether the articles are correctly categorised, so leaving the cats shouldn't be a problem, unless the bot would find it hard to do? Interwiki links is another possible thing to leave behind, although they can still of course be retrieved from the history, and having live interwikis isn't useful from a sorting point of view.The-Pope (talk) 03:25, 11 September 2010 (UTC)[reply]
- Seconded, I was about to suggest the same thing, after someone else brought it up on the CCI page. Interwiki should also be preserved. Added: interwiki really should be preserved, since if it's not, various interwiki bots will both try to restore it and/or also potentially clobber interwiki info on other wikis as the enwiki info is removed. 75.62.3.153 (talk) 04:32, 11 September 2010 (UTC)[reply]
  - See the main task discussion page, where this was originally discussed at greater length. Uncle G (talk) 12:53, 11 September 2010 (UTC)[reply]
The proposed bot is making such a substantive change to Wikipedia that it would be nice if the bot's name indicated its function. Any reason a new account can't be created? One of the stated goals of all this is to draw editors' attention to the need for review, and "uncle g's major work 'bot" sounds tongue-in-cheek and easy to ignore (as I tend to ignore "cute botname x" quite regularly). A more formal name would make these rather drastic changes appear more "official" (consensus and all that). I would not ignore "2010 Mass Reversion of Copyright Infringements", but then Uncle G doesn't get the limelight (I'm not being facetious). Riggr Mortis (talk) 05:12, 11 September 2010 (UTC)[reply]
- "Uncle G's Mass Reversion of Copyright Infringements" is ok with me (shrug). The edit summary should also explain. 67.119.12.106 (talk) 06:30, 11 September 2010 (UTC)[reply]
  - The edit summary does explain. See the example edit. As you can see, I've historically to put names for the tasks (VFDTOAFD, IFDTOFFD, and so forth) in the 'bot's edits, and for this task it's no different. There's discussion on the main discussion page for tweaking the summary a bit more. But it does already contain a clear naming of the task, the edit, and a hyperlink to a detailed explanation. I had that right from the start. ☺ Uncle G (talk) 12:53, 11 September 2010 (UTC)[reply]
- I name my 'bots after the non-'bot account to make them unambiguously clear that they are my accounts. That is something that I'm very averse to changing. I'm strongly in favour of 'bot owners being clearly identifiable, having myself been in the position of having to work out who owns what in years gone by, and this is the way that I make it wholly clear who owns my 'bots. I refer everyone to Wikipedia:Bot policy on this, too. I am not going to remove this clear link between my 'bots and me over concerns about seeking "limelight" (not that I think you meant it that way). The policy requirements, which are in line with my own views on the matter of clearly identifiable ownership and responsibility, outweigh that.
  And if anyone reading what Riggr Mortis wrote thinks that I want the attention and hassle from outraged editors who, despite the notices everywhere on noticeboards, WikiProjects, and watchlists, still find out about this after it begins, that I know is going to come from 10,000 article blankings … In my ideal world none of this would have happened and I'd be peacefully rescuing Nephew and niece (AfD discussion). (Actually, in a truly ideal world, there's be no need for rescue.) Make no mistake: this is not the sort of attention or limelight that anyone wants to seek.
  Plus, of course, I don't want to imply any pretensions to "official" status, whatever people might think that to be. This is me, running my 'bot. I'm a 'bot owner, volunteering to do a necessary task with the tools at my disposal, via an ordinary editing account. Yes, we're talking about copyright cleanup and this is a CCI case. But this isn't a Developer running a task or an Office Action. Uncle G (talk) 12:53, 11 September 2010 (UTC)[reply]
  - - No, I didn't mean to imply that I think you're looking for attention. No offense intended. I tried to indicate this by saying "I'm not being facetious", but that was a poor phrase and probably a complete misuse of the word. In other words, unclear... and I like clarity, which is why I still don't like the bot name, but whatever. (I maintain that this task is large and unique enough—making major edits to articles, the very opposite of most bots—to warrant extreme clarity. Re clarifying who is doing the work, I think the user page of the bot is the best place. Of course you're entitled to recognition for the work, per the IP below.) My biannual wikipedia opinion has been filed; cheers. Riggr Mortis (talk) 03:10, 12 September 2010 (UTC)[reply]
  - You're entitled to the recognition for this work if you want it.
    I see you've got the categories taken care of--nice! Could you do an article with interwikis? I remember Rocío Ríos had a couple.
    A couple of minor category-related issues: 1) I see that some categories in the original articles were transcluded through templates, so the trick of reading the categories from the API and re-injecting them to the wikitext loses a little information. I guess that's ok under the circumstances, but should be mentioned in the instructions. 2) The API has a limit of 50 categories or interwikis per query IIRC, so in the unlikely event that one of these articles is in more than 50 categories or languages, it will take multiple queries to get them all. If that's too much implementation hassle for such an improbable occurrence, maybe you could just log a message (either in the article itself, or on the client side) identifying anytime the category or interwiki API returns the continuation parameter saying more items are available, then handle it afterwards (or stop the bot if it happens more than a few times). That should be tested with a concocted sandbox article before a larger run.
    IMO there should be an initial run of a few hundred articles followed by a few days to let the CCI examiners give feedback, before turning it loose full scale. That includes asking wikiproject members interested in particular categories to see if the existing category intersection tools (wp:CATSCAN is one and I think there are others) meet their requirements. 67.119.12.106 (talk) 00:04, 12 September 2010 (UTC)[reply]

The technical aspects of the bot seem good. Does anyone have a problem with my approving this bot request contingent upon final community consensus at the CCI page (if that hasn't already been reached)? Anomie ⚔ 02:28, 16 September 2010 (UTC)[reply]

There is solid consensus at CCI for the operation as a whole, IMO. There's (understandably) less attention there than here to technical details that may turn out to be significant. I guess Uncle G can deal such issues if they arise. Main thing I'm worried about is user reaction, including possibly from off-wiki. I don't feel like we have much of a plan in place for that. But maybe I worry too much. 67.119.14.196 (talk) 07:52, 16 September 2010 (UTC)[reply]
- Actually, some (IMHO not-that-well-informed) objections are starting to come in, probably as a result of the Signpost article. But not that many so far. 67.119.14.196 (talk) 08:45, 16 September 2010 (UTC)[reply]

I only have one outstanding technical issue, and that one is mainly for my own purposes. I'm probably going to sort the list of articles into alphabetical order, rather than leave them in the order originally listed (whatever that was). This will make it easier for me to spot and recover from problems such as the 'bot failing partway through processing. The downside is that the edits won't be in the on-wiki order. But on the gripping hand, alphabetical order will probably be just as useful to any third parties cross-checking the edits and watching the contributions history, and such people can quite easily (re-)sort their copies of the list, too. ☺ I think it will be much more straightforward all around to use an order that people observing can readily work out. Uncle G (talk) 13:36, 16 September 2010 (UTC)[reply]

Just to clarify: they are listed by size of contributor, longest contribs down. This is a useful structure for CCI, as it means we prioritize the largest problems, but I can see that in this case alphabetical may work better. --Moonriddengirl ^(talk) 14:05, 16 September 2010 (UTC)[reply]

I notice this edit by the bot didn't insert any category to indicate the CCI problem. Was that an oversight or is putting in those categories deferred til the actual run? 67.119.14.196 (talk) 21:34, 16 September 2010 (UTC)[reply]