Jump to content

Wikipedia:Administrators' noticeboard/Incidents/CCI

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 75.57.241.73 (talk) at 21:59, 8 September 2010 (Implementing bot?). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Review of unblock request and discussion of possible community ban

Darius Dhlomo (talk · contribs) is a prolific editor, with over 163,000 edits since 2005. However, he is currently blocked for multiple copyright violations, and is requesting an unblock. A few different admins have looked at his case, but I believe a community consensus is required, as I believe a permanent ban is possibly warranted. Darius has a long history of ignoring other editors; perusing his talk page history you'll see that many editors have tried to engage him in discussion about some questionable edits over the years, but Darius has never bothered to reply, save for the occasional section or page blanking. Only now, upon facing an indefinite block, does he appear to show the slightest bit of understanding or remorse over his actions, although the warnings have been given to him for several years. While much of his work is commendable, he does appear to want to work in a vacuum, ignoring not just policy but also conventions and consensus that he doesn't agree with. (Note that only about 1 out of every 900 edits he has ever made have been to talk pages!) I have never believed that "vested editors" should be given more leeway than anyone else.

I seek feedback from the wider community here about what to do with Darius Dhlomo. I do not believe he should be unblocked without a more thorough review of his editing history, and not just based on his current talkpage expression of remorse. — Andrwsc (talk · contribs) 17:38, 4 September 2010 (UTC)[reply]

I'm not sure there is any need for this. He is blocked, and I have just declined his latest unblock request and directed him to consider WP:OFFER instead. Beeblebrox (talk) 17:44, 4 September 2010 (UTC)[reply]
Fair enough, but there were comments such as "I'm verging on unblocking" so I wanted to make sure there was actually consensus to do so (or not) instead of the decision being made by a single admin. — Andrwsc (talk · contribs) 20:41, 4 September 2010 (UTC)[reply]
  • Comment: not familiar with this case, but the way he repeatedly put "damage" in scare quotes in unblock requests makes me disinclined to give him another chance, at least without successful compliance with drastic restrictions (eg 3-month ban on article edits - talk pages only). Rd232 talk 17:45, 4 September 2010 (UTC)[reply]
  • JamesBWatson sums up pretty well at user talk. I wouldn't support an unblock, per Beeblebrox and Rd232 above. --John (talk) 17:50, 4 September 2010 (UTC)[reply]
  • Comment Why in the world is "Community ban" in this? The editor doesn't get it ... there's no need to jump to that level of drama. As an admin who declined unblock once ... and have actually tried to help them, and even at one point was prepared to support an unblock (but not anymore), as James has said, everything he types just makes it worse. WP:OFFER (talk→ BWilkins ←track) 18:11, 4 September 2010 (UTC)][reply]

Scale of the problem

What's the scale of the copyright problem here? I've identified these so far: Sammy Korir (2006) Joetta Clark (2007) Canyon Ceman (2008, notice removed by this same editor and still a copyright violation right now), and Phil McMullen (athlete) (2010). I am unable to determine why the 'bot thought that Kamil Damašek was a copyright violation. Is this all that there is? Uncle G (talk) 20:29, 4 September 2010 (UTC)[reply]

(de-indenting) Any talk to of either an unblock or a community ban is premature until the CCI finishes. At first blush, though, this sure looks pretty grim. Nandesuka (talk) 23:15, 4 September 2010 (UTC)[reply]

  • I've read much of Darius Dhlomo's editing in my time here. This may sound drastic, but I would say that any article creations with more than four sentences of prose are highly suspect. Any edits adding three or more sentences of prose are also suspect. Believe it or not, despite the edit count, I guarantee that you will not find many edits which fall within this description. Most of Darius' larger edits will be adding tables/templates etc but I believe a small minority of these will yield copyright violations. I expect that these violations with be confined to large edits to biographies and event results articles (like those linked above). Sillyfolkboy (talk) (edits)Join WikiProject Athletics! 01:43, 5 September 2010 (UTC)[reply]
  • When one sees a CCI listing section heading that says "articles 9661 through 9664", it's fairly daunting. Any and all help is most welcome. Uncle G (talk) 01:48, 5 September 2010 (UTC)[reply]

Ban

Per the above, which appears to show that this person is a serial copyright violator, I propose a community ban. 160,000+ edits or not, we do not need these headaches. So what now, we will have to dig through all of his contributions looking for things he stole wholesale from others? Kindzmarauli (talk) 03:34, 5 September 2010 (UTC)[reply]

  • Could we get him involved in cleaning up his own copyvios? 67.122.211.178 (talk) 07:23, 5 September 2010 (UTC)[reply]
    • Given that they continually denied that there was any problem at all in the face of diffs, could we actually trust him to clean up his own copyvios? Normally I'm all for any help we can get with copyvios, but I'm not seeing any indication that they actually would (could?) help. VernoWhitney (talk) 12:29, 5 September 2010 (UTC)[reply]
    • 67.122.211.178, I suggest that you try doing so. Go to User talk:Darius Dhlomo and participate in the discussion of that very thing, there. Uncle G (talk) 12:54, 5 September 2010 (UTC)[reply]
      • He may be able to help identify sources, but my initial idea to have him review these to see which had more than a few sentences was obviously naive. I realized he was obscuring the problem, but I did not realize that he would obscure it so far as to assure us that this happened in no more than 15 articles, when this is so obviously not the case. We can't trust him to help identify issues. :/ --Moonriddengirl (talk) 13:49, 5 September 2010 (UTC)[reply]
        • Well he doesn't (looking at his talk page) seem to be able to help with sources all that much either. I was really disappointed with his reply to my questions - instead he seems to be suggesting that it is the fault of the community for not giving him harsh enough warnings and he felt that, despite the ones he got, he was doing a "fine job". It's not as if the notices are unclear about policy :( --Errant [tmorton166] (chat!) 14:03, 5 September 2010 (UTC)[reply]

(outdent) So nobody supports the community ban proposal? Kindzmarauli (talk) 23:31, 6 September 2010 (UTC)[reply]

  • I'd been waiting to see if he was going to be useful in evaluating his articles. Based on his talk page, it doesn't look like it. He has failed to identify several copyvios. At this point, I'm not sure who would be willing to lift his block. --Moonriddengirl (talk) 23:51, 6 September 2010 (UTC)[reply]
  • Right now we have an active indef block, which seems fine for the moment. I don't think the "nobody gave a final warning or shorter block" is a valid excuse for this, but it's something for the admin/CC community to learn from. Users as unresponsive as Darius sometimes require LART to catch on to how serious their problems are. 67.122.211.178 (talk) 03:22, 7 September 2010 (UTC)[reply]

Mass deletion: Give up and start over

Darius Dhlomo CCI case:

This has been raised above, inline, but I wanted to call it out for discussion here. As near as we can tell, everything this person has written that is longer than a few sentences is a copyvio. This means that the well of articles he has created -- barring the ones that are simple lists of data -- are, quite simply, poisoned foundations upon which we're letting others build.

I propose that we delete every article this user has created, with an exception carved out for data-only articles like lists of winners. This will no doubt be upsetting to those who have worked on those articles since, but I don't see any other way to fairly respect the rights of the original content creators and abide by our own policies. It's very likely that there are others willing to step in and create replacement articles afresh, and I'd rather encourage that than continue to build atop a weak foundation. The task being asked of the CCI - verify the copyright status of over six thousand freaking articles - is, quite simply, beyond what anyone should be asked to do. It is a Sisyphean task. So, if you'll permit me another Hellenic analogy, let's cut the Gordian knot and start with a clean slate.

Comments? Nandesuka (talk) 14:18, 5 September 2010 (UTC)[reply]

  • It's not a Sysiphean task. It's a Herculean task. It's the Augean Stables, to be precise. Uncle G (talk) 14:56, 5 September 2010 (UTC)[reply]
  • I am loathe to do this under ordinary circumstances, but these circumstances are not ordinary. In addition to the tens of thousands of articles at this CCI, we have dozens of other CCIs, some over a year old, with additional tens of thousands of articles that need review...and where copying is not this blatant. In this circumstance, I'd support mass deletion or at least reduction of articles to a one sentence stub. (Please note: that's what we did with the last comparable CCI ([[1]]), and it still took us a year.) --Moonriddengirl (talk) 14:21, 5 September 2010 (UTC)[reply]
    • (I should note that the tens of thousands of articles to which I refer are not only the ones he's created, but the ones to which he's substantially contributed: hiding reverts and minor edits, that's 23,197 to be precise. --Moonriddengirl (talk) 14:29, 5 September 2010 (UTC))[reply]
      • 23,197 edits or 23,197 individual articles? Uncle G (talk) 14:37, 5 September 2010 (UTC)[reply]
        • Individual articles. Mind-boggling, I know, but to quote exactly: "23197 articles from timestamp 2005-11-09 06:15:32 UTC to timestamp 2010-08-30 22:03:25 UTC." I don't know how many edits that represents, but the first on his full contrib list shows 19 non-minor edits to a single article alone. --Moonriddengirl (talk) 14:46, 5 September 2010 (UTC)[reply]
          • When I started going over the biographies some hours ago I started to get a grasp of the scale of the problem here, and I came to much the same conclusion based upon the evidence before me that you, Mkativerata, and Sillyfolkboy all apparently already have: Any flowing prose in an article created by this person was written by someone else. It was either written by a subsequent Wikipedia editor or plagiarized from somebody else's writing by Darius Dhlomo. I thought that the problem wasn't going to get larger.

            However, that article count manages to do exactly that. My perspective on that is that it is of a similar scale as reviewing my contributions (under just this account, not my 'bots or before I had an account). I have, as Uncle G (talk · contribs), touched fewer pages, across all namespaces taken together, than that. Uncle G (talk) 15:23, 5 September 2010 (UTC)[reply]

  • I'd prefer a different approach, of finding some way in which we can rapidly trim the current CCI listing. Then we can review what is left to see whether we still have an unmanageable problem. Is there some set of criteria that we can mechanistically apply to rapidly eliminate the hundreds of 1-paragraph pretty much data-only stubs that this person has made? There are quite a few of them, and eliminating them I suspect would reduce the size of the problem significantly. Moonriddengirl, what is your view on the possible copyright infringement status of articles such as … spins wheel … Jennifer Whittle for example? Uncle G (talk) 14:37, 5 September 2010 (UTC)[reply]
    • Minimal creativity, minimal content. I would regard that as a safe stub. If those couple of sentences were highly idiosyncratic, I'd probably look for a source. :) (Compelled to come back and clarify: I'm not saying that could not be a copyvio; it could, if it copies from another sources and especially if it is one of dozens of articles he's copied from that same source, which would clearly not be a de minimis situation. This is a risk assessment question.) In terms of other alternatives, there is an image-based CCI on which I'm working that is not this scale where most of the images are free of copyright problems. I am mass sorting these to separate out the ones that need review. If we had somebody of great patience who could separate these articles according to "likely to be a problem" and "not at all likely to be a problem", that would help. I had planned to ask the contributor to do that himself, but, as I said above, I'm no longer sure we could trust him with that task. --Moonriddengirl (talk) 14:49, 5 September 2010 (UTC)[reply]
  • Concur this should be reserved for special circumstances and this clearly qualifies. Looking at the scale of the problem I don't see how a volunteer effort could clean all of that up. Extreme measures for extreme actions so I support letting the bots loose to undo what he has wrought. If some stubs are lost in the process they can always be recreated if they are notable by others. --WGFinley (talk) 15:07, 5 September 2010 (UTC)[reply]
  • This is an enormous number of articles; do we really think that these are entirely (or almost entirely) copyright violations? I don't know how representative a single selection could be out of thousands of articles, but Swimming at the 1997 Summer Universiade, for example, doesn't seem to be a copyvio (I couldn't find anything for it on google except sites which copied the wikipedia article). GiftigerWunsch [TALK] 15:11, 5 September 2010 (UTC)[reply]
    • I'm actually wondering how he hasn't had hundreds of warnings from the Coren searchbot by now... GiftigerWunsch [TALK] 15:15, 5 September 2010 (UTC)[reply]
      • No, I don't think they're entirely or almost entirely copyright violations. I think many of them are harmless charts and tables. I think, though, that the number of articles that are copyright violations will probably number in the hundreds. High hundreds or low hundreds? I don't know. --Moonriddengirl (talk) 15:33, 5 September 2010 (UTC)[reply]
      • This very probably is a weakness in the 'bot and in the Google Web approach. Take Mikiyasu Tanaka, for example. Picking some phrases at random from the article (e.g. "Tanaka was sent abroad by the Japan Olympic Committee to study volleyball") and giving them to Google Web doesn't turn up the FIVB profile that it was copied from. But it is a copy, nonetheless. The sentences are in a different order. But they are the same sentences, the only changes being things like exchange of proper nouns for pronouns and the like. (In the original, it is "he was sent abroad by the Japan Olympic Committee to study volleyball".) Uncle G (talk) 15:36, 5 September 2010 (UTC)[reply]
  • Giftiger , try this [2] for Swimming at the 1997 Summer Universiade. 81.145.247.158 (talk) 15:40, 5 September 2010 (UTC)[reply]
  • Rather than a straightforward mass deletion, may I suggest an element of triage? Some of the articles that this editor created will have been edited by others, and some will be more or less notable than others. If we identify and delete those that are tagged as orphans, unreferenced or tagged for notability would that leave us something more manageable? ϢereSpielChequers 16:04, 5 September 2010 (UTC)[reply]
    • Almost certainly not. I've reviewed a few hundred of the biographies now. Yes, it's less than 5% of the problem, but I was selecting at random, from the list before it was sorted, so I have little suspicion that my sample is biased. Notability is almost never an issue on which these subjects have been challenged or tagged. These are not exactly minor sporting figures and events. Likewise, orphan status would be problematic. Many of these articles are on navigation templates for sports teams, regular sporting competitions, and the like, and are unlikely to be orphans. (Quite a few cross-reference one another, too.) Nor, indeed, is lack of any citations a recurrent issue. Darius Dhlomo has linked almost all of xyr creations to on-line sports databases and the like. As criteria for filtering out the problematic articles, from what I've seen I suspect these won't be useful at all.

      I suggested that we find some filtering criteria, above. I haven't yet come up with any, and Moonriddengirl quite rightly notes, above, that it might not be safe from a copyright perspective to even do that. Even the 1-paragraph stubs might be a mass copying exercise, from some source that we are unaware of. All of us who have reviewed the article set so far seem to have come to the same conclusion, that Darius Dhlomo simply doesn't write original prose, at all, anywhere, even if it's only a couple of sentences to make up a small paragraph. Pick a couple of hundred for yourself, check them for copyright violations, and see what conclusions you draw.

      If you find from doing so some triage criteria that actually work in practice, that would be good news, of course. ☺ Uncle G (talk) 16:29, 5 September 2010 (UTC)[reply]

  • I'm going to support whatever Moonridden girl thinks is best. I'm convinced we are going to have to take drastic measures here. If she thinks triage would work, fine, but on the other hand, I don't want to give already over-burdened editors doing herculean work in the copyright field even more to do. Dougweller (talk) 16:13, 5 September 2010 (UTC)[reply]
    • I don't know what bots can do. There are some good suggestions in this thread for narrowing down the list by presumptively deleting those that are least likely to impact others in the project, but I'm afraid that short of mass deletion the only way to process most of this is going to involve a human being (or two or ten) looking at each article. I would definitely support at this point simply wiping out creative text supplied by this contributor. But it's still going to take a ton of man hours just to review them all. --Moonriddengirl (talk) 21:19, 5 September 2010 (UTC)[reply]
      • I think some simple criteria can be defined and then a script can check all the articles against the criteria without humans having to look at them. Defining the criteria would take a little bit of work. Example criterion: find all articles that don't contain text added by humans other than Darius Dhlomo. The "text added by humans" part means ignore edits made by known maintenance bots (interwiki etc), edits to metadata only (like categories), or edits with certain strings in the edit summary indicating various script-assisted edits unlikely to add new human-written text to the article. Deleting those articles might shrink the problem by enough to make manual triage feasible for the remaining ones. 67.122.211.178 (talk) 22:23, 5 September 2010 (UTC)[reply]
  • How about running a script that identifies all of these articles with no more than 500 characters (or some other number) contributed by editors other than Darius Dhlomo. That might be most of the affected ones and they can then be deleted, making the problem a lot smaller. 67.122.211.178 (talk) 18:02, 5 September 2010 (UTC)[reply]
  • As per my comment above, I support mass deletion here. As I read the evidence from Uncle G so far (thank you), it's unlikely we can find any safe triage parameters. Although there's no great rush so I'm more than happy to wait for suggestions.--Mkativerata (talk) 19:18, 5 September 2010 (UTC)[reply]
  • Comment: a skim of some smaller contribs suggest quite a few edits are just editing categories, DEFAULTSORT and the like. Can these be excluded from the listing using some automation? As for creations, I would suggest nuking the lot (perhaps leaving a log which then someone like Article Rescue Squadron can use, if they feel like taking responsibility for checking individual entries. In this case, for "nuking", read "userfying" or "incubating".) Rd232 talk 00:15, 6 September 2010 (UTC)[reply]
    • I don't think userfying or incubating copyvio articles gets rid of the copyright problems. It probably makes the problems worse. 67.122.211.178 (talk) 05:42, 6 September 2010 (UTC)[reply]
      • Properly userfied (or any incubated) articles are hidden from search engines, and so effectively not really published any more. So a bot can do this for all questionable pages immediately, and then a little more time can be taken to see if there's anything that can be salvaged. Then delete the remaining userfied/incubated pages. Incubated pages are deleted after a time anyway (1 month?). Of course they would need to be tucked away in a subsection or something of the Incubator, to avoid swamping everything else in it. Rd232 talk 09:10, 6 September 2010 (UTC)[reply]
  • Suggestion Just have a bot (you can use AWB) mark all of them as CSD: G12 for copyvio. They can then be deleted very easily and quickly. Presumably whatever admin gets to it will check the copyvio status. --Selket Talk 00:38, 6 September 2010 (UTC)[reply]
    • That would be throwing the baby out with the bathwater. I've been looking at random examples, & either I don't understand the criteria for copyright violations (although the example above for Swimming at the 1997 Summer Universiade was pretty obvious when I found the right revision), or I'm not picking the right examples. Anyone but the most painstaking Admin, when faced with all of those edits tagged as CSD will only examine so many at the beginning before either giving up -- or simply untagging the rest. (And if I understood the proper way to clear those which I don't think are copyvios, I'd offer a hand thinning out this list.) -- llywrch (talk) 02:58, 6 September 2010 (UTC)[reply]
    • Selket, we're talking about ten thousand articles, or maybe even 20,000+. It would take admins years to work through that many. 67.122.211.178 (talk) 05:42, 6 September 2010 (UTC)[reply]
    • Let's also not forget that {{db-g12}} has a mandatory url parameter; a reviewing admin should simply remove the template if it doesn't have a url parameter. How would we automatically figure out what the url(s) being copyrighted are? GiftigerWunsch [TALK] 08:27, 6 September 2010 (UTC)[reply]
  • Support deletion of them all. It's better to lose some good contributions than to keep so many copyright violations, and it's nearly impossible to check them one by one. Delete them all and ban the user. Fram (talk) 08:21, 6 September 2010 (UTC)[reply]
  • Support Mass deletion. There is no criteria that can reasonably exclude the right (or wrong) kind of content here. While the worry of losing hundreds of "our" articles is understandable, in reality we have to remember that articles built on other people's content aren't ours to begin with. MLauba (Talk) 08:46, 6 September 2010 (UTC)[reply]
  • Support mass article deletion. 20,000 over articles is just too many to go through, presuming that most of them are copyright violations. I really don't see a way where we can check them manually one by one. Bejinhan talks 10:24, 6 September 2010 (UTC)[reply]
  • Mass deletion is a bad idea, people. That would get our article on 1943 robotically deleted. What we need here, if we truly are going down this route (which I'm sure we're all very hesitant about), is a special-case speedy deletion criterion that we can apply, such as "created by Darius Dhlomo, no substantive content edits other than by Darius Dhlomo, and containing actual running prose commentary rather than just raw uncopyrightable names, numbers, and dates". We need community authorization for Moonriddengirl and other administrators to perform speedy deletions under that specialized criterion. Uncle G (talk) 10:54, 6 September 2010 (UTC)[reply]
    • It's not a speedy category, but we sort of already have authorization for that, though I don't trot it out under ordinary circumstances. Per policy at Wikipedia:Copyright violations: "If contributors have been shown to have a history of extensive copyright violation, it may be assumed without further evidence that all of their major contributions are copyright violations, and they may be removed indiscriminately." Ordinarily, when I encounter an article at CCI that seems to be a copyvio but I can't prove it, I use the copyvio template on the face and Template:CCId to give interested contributors a week to look at them and offer input. (Keeping in mind that by the time we get to CCI, we have verification of multiple violations of copyright policy; this template and this approach are not supported by policy where there is not a proven history of extensive copyright violation from a contributor.) The problem with this approach here is that all of these articles would be listed by bot at WP:CP, which would totally break the board. If we created a different template for the face that would not be placed at CP, but instead categorized by date, it could be manageable to have a bot tag at least the articles he's created. It still requires human review, which would be time consuming, but we could then delete or stub the ones to which he's added substantial text, remove the tag from those without. It's kind of a cross between the delete them all approach (which, having worked CCIs for some time now I understand in this case) and the legitimate desire not to lose more than we have to. --Moonriddengirl (talk) 11:47, 6 September 2010 (UTC)[reply]
      • If you want a category for the created articles to progressively de-populate, I could get Uncle G's major work 'bot to append Category:Articles created by Darius Dhlomo (or some template including it) to everything on VernoWhitney' original list. Would a category with just under ten thousand articles in it be useful? There would be no date or size sorting. Uncle G (talk) 12:58, 6 September 2010 (UTC)[reply]
        • Yes, that would be helpful. I was just working up a template based on {{copyviocore}} and {{CCId}} with a similar notion: User:Moonriddengirl/CCIdf. Not sure if we should blank these articles as we do with {{copyvio}} (if nothing else, it makes it clear that there's a time limit) or take an approach more like {{PROD}}, but since I used {{copyviocore}} it's created from the presumption of blanking. Would something like that be helpful? It's still going to be a ton of work, but it would make the job easier if we delay admin processing for seven days. At that point, we can G6 anything that meets the criteria: extensive creative content added by Darius Dhlomo that cannot be removed without leaving an unusable article (similar to G12). It also allows interested contributors an opportunity to get rid of all creative content added by Darius Dhlomo. But we'd need to do something to flag if the template is removed out of process; in copyright cleanup, that happens quite regularly. People don't want the article deleted, and they don't much seem to care if it's a copyvio or not. --Moonriddengirl (talk) 13:13, 6 September 2010 (UTC)[reply]
          • I'm rather taken with Rd232's idea of — well, to be frank — sharing the pain. I forsee three things needed:
            • A template for the 'bot to blank the article with (Community discussion needed: Should the 'bot blank the article?) I've got something a bit shorter than User:Moonriddengirl/CCIdf in mind. I'll be bold if you've no objection. I think that we might need two templates, one for the blanking and one for a deletion nomination.
            • An explanation of the 'bot's task, to be used in all 'bot edit summaries so that people seeing their watchlists light up with a thousand blanked articles have somewhere immediately to go for an explanation.
            • Instructions for editors on what to do now, to be linked to from the template notice.
          • Additional points: The template notice must be carefully worded. I don't think it fair to have Darius Dhlomo's name come up all over Google. The instructions must be clear that this is a complex task that can end in a multiplicity of outcomes (from {{copyvio}} to simple removal of the template). The instructions must also be clear that editors restoring content shoulder the responsibility for doing so. All of the notices and instructions must be hashed out before the 'bot begins. And we need community attention given to the fact that a 'bot is about to mass-blank some ten thousand articles. (I've updated centralized discussions, and to bring more attention here. I've also put a notice on the 'bot owners' noticeboard.)

            I cannot help you with noticing template removals. But there are other 'bot owners who probably can. There are 'bots that note Proposed Deletion challenges. Uncle G (talk) 14:09, 6 September 2010 (UTC)[reply]

            • No objections from me. :) I'm all about getting the work done, however we can best do it. You can change my mock-up directly or build your own, whatever works. --Moonriddengirl (talk) 14:34, 6 September 2010 (UTC)[reply]
              • I just took a look at User:Moonriddengirl/CCIdf, but I think it's a little impractical, and should be closer to what I suggested below: it should be removed by anyone who feels that they have addressed the copyright concerns. If it remains after a week, it should be deleted as IAR in a similar way to a PROD. Placing a template like this such that admins have to do all the work, would mean this problem is likely to never be solved. Once the majority of the articles are deleted, because no one has challenged their deletion as copyright vios, we can manually check anything that remains to confirm that the users who removed the template because they didn't find / managed to address the copyvios, has been properly addressed. GiftigerWunsch [TALK] 14:39, 6 September 2010 (UTC)[reply]
              • I don't think the articles should be blanked, either; that's likely to just impede evaluation of the articles. GiftigerWunsch [TALK] 14:41, 6 September 2010 (UTC)[reply]
              • Note: I have just created an alternative draft version of the template in my userspace; comments are welcome. GiftigerWunsch [TALK] 14:48, 6 September 2010 (UTC)[reply]
    • ? Darius didn't create 1943. Is it being suggested to delete all articles Darius has ever touched? I thought we were deleting all Darius creations in order to cut the size of the list of contributions needing review. Maybe do that, and blank/special tag all articles he's touched (excluding identifiable cases of minor changes only, like categories)? Rd232 talk 12:06, 6 September 2010 (UTC)[reply]
      • 1943 is on the list of articles that we have. It's on page 24. (Yes, I've been to page 24. I've even tagged 1943 as not a copyright violation.) Mass deletion of everything on the list gets 1943 and many other such articles deleted. We don't need mass deletion, and mass deletion wouldn't be right. What we need is (a) community confirmation that it's an acceptable loss to the project to lose articles such as Paul Easter and Yohann Bernard, (b) community confirmation that we don't trust any running prose content by this editor not to have been copied from somewhere, and (c) community confirmation that it's an acceptable risk to the project to retain articles such as Matías Médici and Franklin Chacón. Uncle G (talk) 12:37, 6 September 2010 (UTC)[reply]
        • I don't think anyone has proposed mass-deleting every article Darius contributed to. Just the ones he created, and (in some versions) just the ones he created that meet certain other criteria (like absence of substantial contributions from other users). So bringing up 1943 is a red herring. 67.122.211.178 (talk) 17:55, 6 September 2010 (UTC)[reply]
  • Alternative: perhaps there is a slightly less destructive way to go about this. Instead of outright deleting every article the user created, why not PROD them all as being potential copyvios, directing to this discussion, with the added condition that users removing the PROD are asserting that they have looked for copyvios and dealt with any they've found? This could be explained in the prod message. The majority of the articles will probably be left for the PROD to expire, in some cases users who have contributed a decent amount to the articles will deal with the copyvios and remove the prods, and then we're just left with a hopefully much smaller number of articles where the prod has been removed without the copyvios being solved; whichever articles survive can then be checked by those who are willing to help solve this case. Any thoughts? GiftigerWunsch [TALK] 12:23, 6 September 2010 (UTC)[reply]
  • The sheer number of copyright violations uncovered here makes it necessary to take drastic action—even if that means deleting thousands of articles. As Moonriddengirl says, there is already a process in place to do this, and I trust the people involved with this to handle the deletions in an appropriate way. GiftigerWunsch's suggestion in the above post is reasonable; {{CCId}} (which Moonriddengirl mentioned above) can be used for that purpose. Ucucha 12:34, 6 September 2010 (UTC)[reply]
    • Whoops, I hadn't read Moonriddengirl's suggestion above; I guess great minds think alike ;) GiftigerWunsch [TALK] 12:42, 6 September 2010 (UTC)[reply]
      • I guess so. :) I'm posing some more ideas on this a bit higher up. --Moonriddengirl (talk) 13:13, 6 September 2010 (UTC)[reply]
        • I think the best thing to do is have a bot deal with every affected article. Initially, this would be page blanking, and replacing with a Darius-related explanatory note template. The copyright problem then goes away immediately, and more time can be taken to rescue articles, with effort drawn from lots of people not normally active on copyright issues. Then, after say 30 days, mass-delete (or possibly mass-incubate, if to a special subsection to avoid swamping the WP:INCUBATOR) any still tagged. Use a bot to notify major contributors of the action, and thus spread the work involved very very widely. Rd232 talk 13:22, 6 September 2010 (UTC)[reply]

Mass blanking of ten thousand articles by a 'bot

Just to make this clear, with its own section heading, here's the idea:

  • A 'bot goes through everything on VernoWhitney's original list of some ten thousand articles, which are the articles that Darius Dhlomo created. It blanks the article and replaces it with a template notice.
    • Community discussion required: Should the 'bot blank the article? This is a copyright issue.
  • I volunteer Uncle G's major work 'bot for the task.
    • Community discussion required: This 'bot does not have the 'bot flag. The 'bot flag will stop people's watchlists lighting up with thousands of blanked articles. I'm happy to have the 'bot flagged for the duration of the task. But do we want people not to notice?
  • The edit summaries of the 'bot link to an explanation of the 'bot's task, which gives people something to look at straightaway to see what's going on and why.
  • The notice itself links to instructions for editors, with a clear procedure for assessing copyright infringement.
  • The notice also categorizes all articles into a (hidden) category Category:Articles created by Darius Dhlomo.
  • Editors assess the status of the article and act appropriately.
    • Community discussion required: Do we put a time limit on this? Do we incubate any articles left blanked after 30 days, per Rd232 above? Or do we just leave them blanked long-term for people to address at leisure?
  • We provide a streamlined version of {{copyvio}} that is dedicated to this task and that doesn't have all of the overhead of the normal procedure. An article tagged with the special process can go straight to deletion assessment by an administrator without the additional listing overheads and suchlike. But administrators can only delete such articles if they were previously tagged as part of the cleanup effort in the first place. (Vandals don't get to abuse the template in the obvious way.)
    • Community discussion required: Could we just re-use the existing speedy deletion notices instead?

None of this is happening immediately. The relevant notices and templates need to be set up before we even think of starting such a 'bot task. This is a proposal, condensed from the above. There are questions yet to be answered. Note that it addresses just under ten thousand articles. There are just over thirteen thousand articles in the cleanup list. If all goes according to plan, this will let the CCI folks reduce the list to just the three thousand or so articles touched by Darius Dhlomo but not created by xem.

The major advantage of this over the mass deletion idea is that it shares the task around ordinary editors, rather than concentrating it in the hands of a handful of administrators. Everyone has the tools to action the next step after a blanking.

Please discuss. Uncle G (talk) 14:53, 6 September 2010 (UTC)[reply]

  • Support this seems more sensible than mass deletion, may I assume that the categories, tags and external links will be left unblanked as they are unlikely to constitute a copyvio? Also would it be possible to run this using a bot without the bot flag? Like many users I ignore bot edits on my watchlist so I would be unaware of any articles that I'm interested in being blanked. Also is there any chance of a second bot run going through these articles and comparing the last version edited by the copyviolater with the version before it is blanked, as there is no point losing such articles. ϢereSpielChequers 15:40, 6 September 2010 (UTC)[reply]
    • There is some merit to the idea of saving the categories/tags (and sigh, maybe the extlinks too), though that info can be saved elsewhere even if the articles are deleted. 67.122.211.178 (talk) 18:10, 6 September 2010 (UTC)[reply]
    • It's either blanking everything or blanking nothing and prepending/appending the template if you want my 'bot to do it. It's only geared for the simplest of content work. I could write something to do more complex editing, but I don't have such ready to hand.

      There's nothing inhibiting a 'bot from running that doesn't have the flag. The flag allows everyone to exclude 'bot-marked edits from various lists, like watchlists and recent changes. Usually 'bots don't make edits that are interesting to recent changes patrollers or people with watchlists. The question is whether that's true in this case. Please discuss.

      As for the last, note that a blanked article is not the end point, but a mid-way point in the process. The whole idea of blanking, rather that deletion, is that we don't lose edit history unnecessarily, and that anyone with the ordinary edit tool can recover content if there turns out to be no copyright violation. Uncle G (talk) 16:38, 6 September 2010 (UTC)[reply]

      • IMO the bot should be flagged (RC patrol has enough to do without dealing with this) but should log all its actions on some special pages that everyone can review. That includes describing its analysis of pages that it then decides not to edit, so the log pages would be more informative than the bot's contrib history. The bot should probably run under a specially made new account for this purpose too. 67.122.211.178 (talk) 22:16, 6 September 2010 (UTC)[reply]
  • Comment Do you think the bot could identify and tag separately those articles with > 500 edits, or something in that neighborhood? At that point there is likely to be little residual copyvio, so it needs to be looked at differently. Also, would the coren copyviofinderbotthingy (phew) bot be able to check a category, and would it be any use? 69.236.190.48 (talk) 16:08, 6 September 2010 (UTC)[reply]
    • If VernoWhitney is capable of coming up with separate lists of such articles to work from, I can certainly tag them differently for each list. I don't know what CorenSearchBot is capable of in respect to scanning old revisions of existing pages in categories. Uncle G (talk) 16:38, 6 September 2010 (UTC)[reply]
    • I don't believe that articles with >500 edits necessarily have little residual copyvio. Look at the history of Between Silk and Cyanide, which I just had to mostly-blank because somebody inserted a copyvio in 2006, even though dozens of other editors worked on the article after that. The additional editing expanded the article a lot and smeared the copyvio all over the article, so it was no longer revertable in one lump. There is a comment on the talk page explaining further. I do think articles with substantial text added by human editors (bots edits don't count) other than Darius should be flagged. 67.122.211.178 (talk) 18:10, 6 September 2010 (UTC)[reply]
      • Well, I think that articles with 500 edits generally are large enough they would need human review the most of any of the articles. NativeForeigner Talk/Contribs 18:56, 6 September 2010 (UTC)[reply]
        • Yes. Perhaps the bot could blank those in two edits. It would first detect all the text that had been added by Darius (as opposed to other editors) and mark that text with a font or color change and save it. Then it would blank the whole article and save again. People wanting to restore the article could look at the marked-up version in the history and use that to help separate Darius text from non-Darius text. 67.122.211.178 (talk) 22:29, 6 September 2010 (UTC)[reply]

Prod-like proposal

I thought I'd move my proposal here as it's getting lost in the discussion and I feel it would be beneficial. I propose that all articles that Darius has created be tagged with a template similar to a PROD (I've created a draft here which all are free to edit and comment on), such that the article will be automatically deleted after 7 days (or perhaps longer, depending on consensus). Like a prod, anyone can remove the template, but unlike a prod, in doing so they are asserting that they understand copyright violation, and have thoroughly checked the article and fixed any copyvios. This would be clearly explained on the template.

Any articles which are not checked or which cannot be saved, will still have the template after the time has expired, and will be deleted per WP:IAR. Those articles which survive can then be double-checked to confirm that they are not copyvios.

Hopefully during this process, a large number of editors will have noticed that an article they've contributed to is being sorta-prodded, and will help to remove copyvios and then remove the template. Any articles where no one noticed the sorta-prod deletion are acceptable losses, being deleted per usual PROD rules anyway. GiftigerWunsch [TALK] 15:02, 6 September 2010 (UTC)[reply]

GiftigerWunsch [TALK] 15:02, 6 September 2010 (UTC)[reply]

  • This makes much more sense then a mass deletion. I support this. elektrikSHOOS 16:57, 6 September 2010 (UTC)[reply]
    • By this process, the articles will be automatically deleted unless somebody objects. By the process above, human review is needed, but articles that are not infringements will be salvaged. --Moonriddengirl (talk) 16:59, 6 September 2010 (UTC)[reply]
      • On the other hand, manually reviewing thousands of articles requires an enormous amount of time and resources, and alerting those who have contributed to the articles to help sort out the copyright issues by otherwise having them deleted after a fixed period, means that the job will be distributed among many people, and hopefully achieved in less time. GiftigerWunsch [TALK] 17:50, 6 September 2010 (UTC)[reply]
        • On the gripping hand, when the 7 day Proposed Deletion period expires on ten thousand articles all at the same time, we're right back in the same position that we started from. ☺ Uncle G (talk) 18:41, 6 September 2010 (UTC)[reply]

There's not that much different between this and the above proposal, except in the matter of imposing a time limit, and defaulting to automatic deletion. Defaulting to automatic deletion will get screams of outrage from people who find out after the fact, I predict with a fair degree of confidence. (That's in part why I've pointed to this discussion on as many noticeboards as I have. I want to reduce the number of people who find out after the fact.) "Why wasn't I warned before you went off and deleted thousands of articles relevant to my WikiProject? I'm going to abuse an administrator for this!", they'll say. Leaving articles blanked, with just a warning notice on them, to be addressed at somewhat greater leisure, avoids that drama before it starts. It also addresses the concern that people have — that we all have — about not losing articles that aren't copyright violations at all. Uncle G (talk) 18:41, 6 September 2010 (UTC)[reply]

The above proposal is now more concrete. I've boldly updated User:Moonriddengirl/CCIdf and written Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation (for edit summaries) and Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help (for the notice). Please review, discuss, and boldly improve. Uncle G (talk) 18:41, 6 September 2010 (UTC)[reply]

I'm somewhat concerned, though, that "at somewhat greater leisure", considering that we're talking about thousands of articles, is going to mean months, or longer. If the articles are all going to be blanked until such time as an administrator (or at least other editors) manage to check them for copyvios and restore them, what's the difference? If the deletions are found out after the fact, interested parties could request undeletion so that the article could be rechecked for copyvios, the references could be salvaged, the article could be written, or whatever else. GiftigerWunsch [TALK] 19:25, 6 September 2010 (UTC)[reply]
Thousands don't need to take months if everybody pitches in. I haven't seen you at CCI yet ;) fetch·comms 02:12, 7 September 2010 (UTC)[reply]
Fetchcomms, rather than looking at 1000's of articles for a few seconds each, I wonder if you could help investigate Rocío Ríos (see section below). That boilerplate text came from somewhere. If we can't establish where, then we can't rely on whatever processes you're proposing to use on the other 10,000 (or 20,000 or 40,000 depending on how you count) affected articles. 67.122.211.178 (talk) 09:13, 7 September 2010 (UTC)[reply]
I like this suggestion. Will discuss more at Nuclear option below. --Moonriddengirl (talk) 12:14, 7 September 2010 (UTC)[reply]

When voting is not evil

I'm suggesting something that is "out of the box" to help solve this problem. The outlines of this suggestion are as follows:

  • We have a real, majority-rules vote on whether to mass-delete all of the articles DD created. Or to adopt Uncle G's bot proposal above.
  • A username's vote is counted only if that username has reviewed 10 or more of the articles listed at Wikipedia:Contributor copyright investigations/Darius Dhlomo or one of the associated pages.
  • One can vote more than once (i.e., review 20 articles, you get 2 votes), under different usernames, etc. One doesn't need to be an Admin.
  • The vote is a simple yes/no. Either we do it or we don't.
  • At the end of a week or 10 days, the votes are counted.

what I like about this idea is even if the result is to mass-delete these articles, some will have been examined & determined to not be copyvios. (Based on previous discussions on AN/I, this could lead to as many as 100 people participating, which would mean at least 1000 articles examined.) Hopefully, this would give us further information about a more precise filter for where the copyvios are & aren't. Thoughts? (And while the conversation continues, I'll be working thru the list of possible copyvios; I know I can't save all of the non-copyvios, but I know I can save some.) -- llywrch (talk) 17:04, 6 September 2010 (UTC)[reply]

  • This doesn't sound like a good idea to me. I'm even skeptical that any of us can review an article and determine the absence of copyvio. Even if the article has no text (e.g. just a table), maybe that table was copied from somewhere. And anyway, voting is always evil, and the likelihood of getting 10000 articles manually reviewed reliably is very small. If you've got a criterion like "all info is in a table and there are no strings of more than 5 consecutive english words" we could run a script that finds and lists such articles. That might be quite a lot.

    Anyway, reviewing some tiny fraction (10 or 20) of the articles shouldn't give a special say over the rest of them. If you want extra authority over the whole collection of articles, you have to review all of them. Under your voting scheme, voters should also accept (and be assigned) responsibility for any copyvios that later turn up in articles you have declared to be free of copyvios. Otherwise your suggested voting system gives influence-seekers incentive to "review" articles as quickly as they can, and potentially miss a lot of bad stuff. 67.122.211.178 (talk) 18:12, 6 September 2010 (UTC)[reply]

  • If a human can't properly determine what is a copyvio, then how can one tell a bot how to do it? And the idea of giving "a special say" isn't that: it's to encourage people to actually tackle the problem of cleaning up this mess, rather than the usual process of talking about the problem. (Out of over a dozen posters to this lengthy thread -- of whom three want a mass deletion -- less than half have even left evidence that they reviewed any of the entries at the CCI.) People here appear a lot more eager to tell us what the solution is & expect someone else to do it, than to actually help fix the problem. And fixing the problem isn't hard just tedious, & would be handled quickly enough if enough people spent their time there & not at this DramazBoard. -- llywrch (talk) 21:43, 6 September 2010 (UTC)[reply]
  • The difficulty of determing what is a copyvio is at the root of the mass deletion proposal. The proposal is basically that if the article was created by Darius, there is an a priori likelihood that it is a copyvio whether we can find the original source or not, so we should delete it in the expectation that someone else will eventually recreate it if the topic is important. As for special say, no, sorry, if you review 20 articles, that's less than 0.1% of the articles affected, and you can't say "well look, I've made a significant dent in this problem" because 0.1% is insignificant. If you review 5000 of the articles, then your argument may have a bit more validity. That's why most of us (I think) are giving high credence to the views of Moonridden girl, because of the enormous amount of time she's spent dealing with this sort of problem. Looking at 20 or 50 or 100 of these Darius spewings doesn't hold a candle compared to that. 67.122.211.178 (talk) 22:01, 6 September 2010 (UTC)[reply]
  • Yeah, I've looked through about 40 pages he created, and I have to say that, while most are not vios right out of the box, some are, while some are added later. There's no way to figure out any pattern unless you get through at least a couple hundred. I disagree that we should just delete assuming that they're all vios, which they obviously are not. Manual is hard, but not impossible if everyone helps. These articles are of notable people, and shouldn't be deleted on the assumption that someone will eventually create a non-cv version of them. fetch·comms 04:28, 7 September 2010 (UTC)[reply]
  • Not impossible if everyone helps -- but why should people spend their time that way? [S]houldn't be deleted on the assumption that someone will eventually create a non-cv version of them. I don't know what the issue is, we have 3.3 million articles that all got created somehow, and we're talking about a bunch of almost content-free stubs once the presumed copyvios are removed. Maybe it's just me but I don't see much point in getting attached to such articles. Nobody is proposing to salt them from recreation. We got along without them up to when they were created (some fairly recently) and (for most of them) nobody else thought they were interesting enough to edit substantially. I could see some value to keeping the names/references/categories in a list someplace. We already have the names in the CCI report, if that helps. 67.122.211.178 (talk) 08:10, 7 September 2010 (UTC)[reply]
  • Well, if we just reword all of the stubs and delete the (few) longer articles he's written or expanded with prose (not just standard tables), while keeping them blanked until a rewrite (as the option below outlines), I see no reason why they all need to be removed. Hiding the cv while preserving the content history sounds like a fair compromise. fetch·comms 00:14, 8 September 2010 (UTC)[reply]

Triage criteria

Let's talk a little bit about triage criteria. All these points are open to discussion, but initially for this purpose I'll assume:

  • "Triage" is a process designed to be carried out by a bot or script that examines (in some way) all 20000+ articles that Darius has touched, and labels those that meet given criteria that we're trying to specify here. The script should run with no human intervention and not much human review of the final output. Of course we'd first run it on smaller sample sets and examine the results carefully to tune the criteria. Triage should divide all those articles into several possible categories, such as:
    • Articles needing no special attention (Darius's only edits didn't add any significant content)
    • Articles presumed copyvio and which should be deleted or blanked without additional attention (all significant content in the article came from Darius)
    • Articles presumed containing copyvio but which should get careful attention anyway (e.g. article contains significant content from both Darius and others)
    • Articles that the bot isn't sure how to classify, but that a human can probably tell with a quick look.

By "manual edits" I mean edits to an article made manually by human editors. "Human" is specified because edits by bots shouldn't count for this purpose. (A spot check of the Darius-created articles indicates that the majority of the edits in them are probably bot edits). Script-assisted human edits (routine maintenance scripts) mostly shouldn't count either. "Text" means any sequence of more than 5 consecutive words in the article body (not in category or interwiki tags). Here is a simple proposal for criteria and labels:

  • Articles that contain no text added by Darius (just tables with names and numbers) => no attention needed
  • Articles that contain text added by Darius and no text added by others => delete
  • Articles containing text added by both Darius and by others => if text is more than 80% Darius, then delete, else flag for attention
  • Articles with 2 or more manual edits from editors other than Darius => these may be of interest, flag these in a sample set and study for further ideas.

Any thoughts?

67.122.211.178 (talk) 19:11, 6 September 2010 (UTC)[reply]

I like it, but is it technically feasible to separate these articles? If so, go for it. fetch·comms 19:54, 6 September 2010 (UTC)[reply]
Yeah, most of the above should be doable. I may try coding something later today. 67.122.211.178 (talk) 20:17, 6 September 2010 (UTC)[reply]
This sounds really interesting; I'll look forward to seeing what you can come up with. I would, though, want to find some way to exclude those articles that have been cleared already through the CCI. We've got some really good volunteer work going on there. :) --Moonriddengirl (talk) 23:16, 6 September 2010 (UTC)[reply]
For the moment, I'm using User:CorenSearchBot/manual as a double-check; it's a bit buggy and not very reliable (I caught two vios it missed), but might help some users. It definitely needs to be reviewed by a human, though, but for articles that already are two sentence stubs, there's really no way it can be a cv if the bot passes the second sentence (as the first is usually changed to be MOS-compliant). If anyone can find a better cv-checker process, however, please tell everyone. The Earwig's tool only does one page at a time and he told me that he doesn't have time to let it process multiple ones at once. fetch·comms 00:20, 7 September 2010 (UTC)[reply]
I thought we'd already concluded that a non-finding from the searchbot doesn't really tell us much. We also don't know about possible copying from materials like printed almanacs and magazines, that have never been online so won't show up in any search engines. We're left treating all Darien text additions as vios whether we locate a source or not. Are we still going by that approach? Are there enough different views on this that we should open a discussion section about it? 67.122.211.178 (talk) 01:13, 7 September 2010 (UTC)[reply]
There are a lot of stubs he made that can't be vios. I mean, "X (born [date] in [place]) was a [occupation]" is a standard first sentence; if that was a copyvio, then it'd be pure coincidence. Many of the vios I saw are that he creates a page (most back in 2006), then comes back in 2008 to add the cv in. Now, this isn't the case for every article, as he also creates cvs at the beginning, but deleting everything he created doesn't help. I also have seen several instances of what he has copied being changed over the years so that it is no longer a cv at this point. The only way to do this right is not to take the easy way out and delete all the pages, but rather to separate the more-likely copyvios (excluding category/template/table-only changes and going through the rest). For the issue of print sources, it may be better to stubbify articles to which he has added more than a couple sentences but do not appear to have been taken from Internet sources. fetch·comms 02:07, 7 September 2010 (UTC)[reply]
How do we even know that the people in those articles even exist? See trap street. If Darius got a sports almanac and entered info about some fictitious player who the almanac writers invented, that's a copyvio even if no words were copied. I just can't work up much motivation to try to retain articles that nobody other than Darius contributed to. Most are quite unencyclopedic, sort of a phone book about athletic events. 67.122.211.178 (talk) 04:18, 7 September 2010 (UTC)[reply]
I'm not much of a deletionist/inclusionist kind of guy, but I don't think that it's fair to just delete all these potentially useful articles of notable people (to some degree). The issue of fictitious persons in his sources can't really be helped, I guess; we could basically delete every article without a source under that premise. I know that going through the list manually is not desirable, but I'd rather check everything and salvage what we can. fetch·comms 04:24, 7 September 2010 (UTC)[reply]
We use an WP:AGF approach to normal contributions, but in Darius's case there's such a rampant pattern of vios that we may be better off treating every one of his contributions as tainted. I asked on his talkpage if he copied from any print sources, but he hasn't responded yet. The articles that don't contain (presumably copied) text inserted by Darius are mostly uninformative stubs, not all that useful compared to just using a search engine. We do in fact now have a policy (being implemented in stages that are still under way) of deleting all unsourced BLP's. 67.122.211.178 (talk) 06:03, 7 September 2010 (UTC)[reply]

Nuclear option

Above I suggested blanking articles and instigating a mass checking effort. But at this point I reckon triage here should involve excluding edits which cannot be copyvios, particularly by virtue of being too small, or merely changing categories etc. Everything else should be presumed copyvio of one form or another, possibly from print sources (and therefore hard to impossible to identify). All the evidence (and Darius' inability to so far point to things which are not copyvios is damning) suggests to me that Darius simply does not write substantive prose. My feeling is he's one of those people who (possibly English not first language?) isn't confident writing, and so virtually always copies with some minor modification. This would explain his ignoring the warnings - he felt simply unable to contribute without doing it in this copyright-violating way. In logical consequence, all prose he's ever written which remains in articles should be deleted as being a copyvio of something or other. This seems to be a situation where it is unreasonable to say "let's see which of these are copyvios, and remove it if proven". "nuke the entire site from orbit. It's the only way to be sure." I'm not even basing my view on the amount of work involved in checking: I'm basing it on the unacceptable likelihood of large numbers of copyvios not being identified, and so retained - especially if the checking is done by people not familiar with copyright checking. So i) bot-blank and tag all affected articles, excluding whatever can be; ii) the tag requires all Darius prose to be removed for the article to be restored. This has the tremendous advantage of simplicity iii) allow a long time to handle blanked articles, say a year. Then delete any that are left. Rd232 talk 12:00, 7 September 2010 (UTC)[reply]

I'm beginning to find the sprawling proposals very hard to follow here. :D I agree with you that what cannot be copyvio should be excluded. We have traditionally excluded edits below 100b as likely to be de minimis at worst. It's not perfect, but it's workable. (Workable matters. This is just one CCI. We have several dozen still open, some of which are over a year old.) In this case, I agree that we need to consider that all creative content added by this user is a copyright problem that needs to be removed. I like the tag modification as made by Uncle G (see section above): User:Moonriddengirl/CCIdf. The combination of that tag, Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation and Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help would invite all interested contributors to help out. The only thing we might wish to reconsider is how they can recognize problems. Given that there is a risk of offline sourcing (so far, none found, but some of the sources that have been found didn't show up in a Google search engine), we should presumptively remove or rewrite all of his creative content. Perhaps those who want to help out should be invited just to identify what creative text he added and remove or rewrite it. --Moonriddengirl (talk) 12:23, 7 September 2010 (UTC)[reply]
My bad. I guess what I'm really doing here is arguing that the tag directions to editors finding the blanked page should not tell people to check whether there is a copyvio; but to simply to remove (or rewrite) any Darius prose, because any such prose almost certainly is. And it's an error-prone waste of time trying to prove a negative. Beyond that, this is basically the "Mass blanking of ten thousand articles" proposal. Rd232 talk 12:57, 7 September 2010 (UTC)[reply]
I agree. I think these two proposals merge well together. --Moonriddengirl (talk) 13:08, 7 September 2010 (UTC)[reply]
A lot of what Darius wrote were two-line stubs that probably aren't vios (and if they were, it'd only be the second sentence because the first would have had to be changed to match the MOS) as he uses fairly plain wording everywhere: "X is an [sport] player. Xe won the [medal] with Y country in the Z Olympics. Xyr personal record was [time] at [place].", etc. So, I agree that everything he wrote has the potential to be a vio, but for many of the stubs, just doing a 30-second reword of the only possibly offending sentence or two should be enough to remove any lingering doubt. Until we can run through all those, just keeping it blanked ought to do fine. As long as a query can eliminate the minor diffs (categories, templates, tables, etc. changed only), then we have a lot less to worry about. fetch·comms 00:11, 8 September 2010 (UTC)[reply]

Simple proposal

This has three parts:

  1. Skip the articles where he changes less than 200b. He's either adding a row to tables, adding categories, or adding templates.
  2. Everyone stop worrying about this AN/I thread and go check some articles. If there are 23,000 articles on the list, 100 users can go through 230 articles each and that will be that. 230 is not a lot, considering I went through 20 in about three minutes last night, which were ones where about 100kb was changed. If we just skip those, and start at the end, we can eliminate a good portion of the articles as cv-free, and deal with the likely cv ones.
  3. Blank all articles he created, and make a separate list of those to which he added more than 1,000b, which need individual examination.

Otherwise, the triage idea above seems good. fetch·comms 19:54, 6 September 2010 (UTC)[reply]

200kb is the size of the entire ANI page. Do you mean 200 bytes? I think that is too many (1 word = approx 6 bytes). 67.122.211.178 (talk) 20:16, 6 September 2010 (UTC)[reply]
My bad. I meant bytes. Fixed accordingly. 200b is around adding a few categories, a medal chart template, or couple of rows to a table. fetch·comms 20:47, 6 September 2010 (UTC)[reply]
1 word = 6 bytes so 200b = 30+ words, which can be a pasted sentence. I would not want to accept anything that had more than 4 or 5 consecutive words added, where "consecutive" means e.g. not in separate table cells. 67.122.211.178 (talk) 22:09, 6 September 2010 (UTC)[reply]
Articles to which he has added below 100b are excluded from the listing as minor. If we add them back in, the number of articles we must check jumps from 23197 to 41,108. --Moonriddengirl (talk) 22:51, 6 September 2010 (UTC)[reply]
What software are you using to find that? I've started fooling around with some code, but might be duplicating existing effort. 67.122.211.178 (talk) 04:10, 7 September 2010 (UTC)[reply]
It's our CCI tool; you can read about it and access it here. I'm afraid I know zilch about how it works...just that it does. :) --Moonriddengirl (talk) 10:40, 7 September 2010 (UTC)[reply]

Should we move this discussion?

Subpaged.

I begin to think we should move this to a newly created project page or set of pages (some such pages have already been created), or an RFC. Otherwise it is going to swamp ANI pretty soon. Part of the discussion should be about technical aspects of proposed bots and scripts, that would be too far in the weeds to clutter ANI with. 67.122.211.178 (talk) 22:48, 6 September 2010 (UTC)[reply]

yes It's already a 1/16th of the page.--intelati(Call) 22:50, 6 September 2010 (UTC)[reply]

Breathe deeply

I'm somewhat aghast that the nuclear option, blowing up 10,000 articles by bot, is being discussed so cavalierly. Certainly, this is the worst, most extreme option — to be avoided unless absolutely unnecessary. Having reviewed Darius' talk page — both the current and past versions — I am struck by the fact that he does admit having made serious mistakes but has contended that the number of flagrant copyright violations is relatively minimal in number and that he has offered to help find them and liquidate them.

I wonder why the CorenWhatchamacallit Copyright Bot isn't run over each and every article to which Darius has contributed to flag copyright vios? That's what alerted us of the problem to begin with, did it not? Let the bot check for violations — subject everything to review.

I think the punishment meted out to this editor should be severe, but I don't see why the most draconian corrective measure is being discussed before all corrective options have been exhausted. Bot-check the works and blow up everything that comes back positive for copyright violations... Carrite (talk) 06:12, 7 September 2010 (UTC)[reply]

There seems to have been earlier discussion concluding that bot-checking doesn't help that much and we have to assume it's all tainted. If you are suggesting we rethink that notion, it's probably best to open a new section of this page for thoughts and comments. I personally don't feel very attached to articles with no substantial content contributions from anyone other than Darius. If those articles were so important, other people would have edited them too. I agree about not blowing up the ones with contributions from others. 67.122.211.178 (talk) 07:19, 7 September 2010 (UTC)[reply]
This is not in any way, shape or form, punishment: this is dealing with copyright violation on an unusually vast and barely manageable scale. There is a proposal to blank affected articles with an explanatory note (with variations on the theme), which deals with the problem immediately, allowing anyone interested in the article to deal with the problem. If anything, I have a concern that spreading the copyvio checking so widely risks too many trickier cases being missed by people not normally involved in checking such things. Good instructions will mitigate that, but it's still a worry. Rd232 talk 09:05, 7 September 2010 (UTC)[reply]
Carrite, the thing is, 10'000 articles built upon a copyvio aren't ours to take and publish. They aren't even ours to modify: even if the text has been edited out later on to the point there is nothing left of the original, we have created an unauthorized derivative work. Those are not our articles, they're effectively someone else's, and we have no claim on them. There's nothing cavalier about that. MLauba (Talk) 09:06, 7 September 2010 (UTC)[reply]
I don't think the "10,000" number is anywhere close to accurate, unless Darius is lying through his teeth. Do we know that the problem is actually this vast? Nor do I have any problem whatsoever blowing up any article found with copyright violations. The question is this: how big is this problem, really? I would suggest that the punishment help mitigate the crime, that for the next six months Darius be limited to editing, with a new account, articles which he created and only articles which he created... With a view to eliminating copyright violations. His work can silently be checked "over his shoulder." At the end of that period, extreme scrutiny should be applied to remaining articles to see if the problem has been fixed or not, and the community can proceed from there based upon findings made at that time. Darius' previous account name should be locked out and a new account name initiated, with edits starting again from zero and no autoreviewed status for multiple years, in my estimation... Current thinking seems to be obsessed with making the problem instantly go away by mass deletion of the good, the bad, and the ugly via automation in one fell swoop. My suggestion is that the culprit be instructed to get to work for half a year fixing his own mess. Carrite (talk) 11:28, 7 September 2010 (UTC)[reply]
With all due respect, copyright violation is not like spelling mistakes, it's not something to be fixed when we get round to it. And you seem not to have heard me when I said this was no about punishment. Finally, there appears to be a consensus that the problem affects an enormous proportion of Darius' substantive prose edits. This is in no way, shape or form a minor issue. Rd232 talk 11:41, 7 September 2010 (UTC)[reply]
I'm not saying it's a minor issue and I'm not saying it shouldn't be immediately addressed. And I did hear you when you said this was not about punishment — and I argue that it should be about punishment, with the punishment being the immediate fixing of his mess by the culprit, bearing in mind that Rome wasn't built in a day and that it will take time to ferret out everything... Further, I challenge the assertion that any consensus can be drawn about the scope of the problem until it is systematically studied. See the random sampling below. Expand that process, let's look at this problem scientifically before we go nuclear on it. Carrite (talk) 12:04, 7 September 2010 (UTC)[reply]
Have you looked at the actual copyright investigation subpages? Beginning with Wikipedia:Contributor copyright investigations/Darius Dhlomo and moving through (there's a sidebar above that links to them all), articles that have been checked and cleared are marked Red XN, while articles wherein copyright problems have been found are marked Green tickY. The listing of five articles below is significantly smaller sample than has already been evaluated. --Moonriddengirl (talk) 12:08, 7 September 2010 (UTC)[reply]
I quickly count 32 green checks and 159 red Xs = 16.75% violation rate. Carrite (talk) 12:19, 7 September 2010 (UTC)[reply]
That would be a much higher outcome than the hundreds I've predicted, but it's possible that contributors are zeroing in on more problematic areas. :/ --Moonriddengirl (talk) 12:28, 7 September 2010 (UTC)[reply]
On page 2 I figured out how to let the browser find function do the counting and come up with 7 violations and 100 clean pages = 6.5% violation rate, total now 39/259 = 13.08% violation rate. \Carrite (talk) 12:33, 7 September 2010 (UTC)[reply]
On page 3 it's 5 bad, 98 good = 4.85% violation rate. I need to go back and check my first count mechanically and redo the arithmetic... It looks like a violation rate of under 10%... Carrite (talk) 12:38, 7 September 2010 (UTC)[reply]
% violation rate is meaningless as a proportion of all edits - the problem only applies to substantial prose edits. Most edits (including my sampling on page 3) are not substantial prose; they're adding infoboxes, categories, basic data and the like. Rd232 talk 12:53, 7 September 2010 (UTC)[reply]


Darius has no credibility on this issue; he assured us he had done this in no more than 15 articles. We had more than doubled that count before Uncle G stopped counting. Since his block at his talk page, Darius told us that Fabián Roncero was fine; it isn't. He told us that Núria Camón is fine; it isn't. How are we supposed to trust him to identify his copyright violations, much less acknowledge them and address them? Even though I believe that there will probably be hundreds rather than thousands of articles that are a copyright problem by the time the investigation is done, there are still tens of thousands of articles that need review. Having somebody silently check over his shoulder only works when we know that he (a) can and (b) will accurately assist. --Moonriddengirl (talk) 11:36, 7 September 2010 (UTC)[reply]


(unindenting) I did a recount, manually counting numbers over 100 since the find feature only counts that high on Safari. I found:

  • Page 1 — 37 violations, 168 clean pages
  • Page 2 — 7 violations, 212 clean pages
  • Page 3 — 5 violations, 98 clean pages
  • Page 4 — 0 violations, 12 clean pages = 49/539 = 9.09% violation rate

Most of the violations were on the first page. I'm not sure if these things are chronological or not or whether those articles were being critiqued more harshly or what was going on.... The violation rate seems an anomaly for the first page, compared to the next two. What is clear is that we are probably not talking about "10,000" copyright violation articles here, but some substantially lower number in the general range of 5-8% of Darius' total contributed pages. Carrite (talk) 13:05, 7 September 2010 (UTC)[reply]

So, basically, you're saying that you believe my estimated hundreds is low? You may be right. --Moonriddengirl (talk) 13:11, 7 September 2010 (UTC)[reply]
There are 13,542 articles in the queue... If you'll accept my premise that the 9.09% rate is somewhat inflated for some unknown reason (learning curve of the editor or more harsh judgment of inspectors) and that the actual rate falls in the 5-8% range, we are talking about between 677 and 1,083 articles with substantial issues, give or take. "Hundreds" is accurate. Carrite (talk) 13:18, 7 September 2010 (UTC)[reply]
I believe that User:VernoWhitney has indicated that the numbering in the queue is inaccurate. We have had some difficulties with putting together the listing because of its scope and the fact that initially we tried to isolate only articles he had created. There are over 23,000 articles listed by our CCI program excluding reverts and minor edits, which makes 5% something in the order of 1,150. --Moonriddengirl (talk) 13:23, 7 September 2010 (UTC)[reply]
(Oh, I just noted that above you mentioned not being sure if these things are chronological: they are not. They're listed by size of total contributions, beginning with greater. I would expect more problems in the front end. That's usually the way it goes. --Moonriddengirl (talk) 13:25, 7 September 2010 (UTC))[reply]
There are 23,197 total articles, pages 1-10 are articles they created and then the numbering (and order by size) restarts on page 11 for articles they didn't create and just edited. VernoWhitney (talk) 13:36, 7 September 2010 (UTC)[reply]
Thanks. :) More unusual than I knew! --Moonriddengirl (talk) 13:39, 7 September 2010 (UTC)[reply]

(unindenting) Okay, so we're seeing a much higher copyright violation rate with long contributions vs. stubs — is that a fair summary? Numbers 1-1000 in size, maybe something like 1 out of six 5 of those are defective, whereas the copyright violation incidence rate falls to what might be considered "normal" levels with shorter contributions (has the question of copyright violation across random WP articles ever been studied? Four or five percent of articles having "problems" would be a pretty reasonable guess, I'd think...).

Anyway, what seems to need to be done is a high-priority manual checking of the top 1000 or so original articles as well as the top 1000 or so content contributions to already-established articles, with maybe some sort of cursory bot-checking of the remaining short and stub articles. Is that a reasonable perspectiive??? Carrite (talk) 16:18, 7 September 2010 (UTC)[reply]

Yes, and quite common for CCIs. I'm not sure what you mean by "normal" levels, though. I don't know if anybody's ever done a random copyvio study of Wikipedia articles. I would kind of hope not, as I'd rather see anybody with that kind of time on their hands trying to help clean them up. :) Again, this is one of dozens of CCIs. We've got more articles than I want to count waiting for review. I don't know that it's reasonable to limit review to 2,000 out of the 23,000+ articles that he's done non-minor edits to. By "top" I assume you mean contribution length: a lot depends on the pattern of the CCI subject. He's got a lot of table and list contribs. Those are high in volume, but low in risk. A single paragraph of creative text from him would worry me a whole lot more than his most prolific contrib, [1982 in athletics (track and field) here] (already cleared). I see that copied content has already been found in article #9514 of the articles he's created. Limiting our checks to the top 1,000 of his articles would stop well short of that. If a bot didn't detect it, the copied content would remain. --Moonriddengirl (talk) 17:25, 7 September 2010 (UTC)[reply]

In-depth study of small random sample

I picked 5 random Darius-created articles (using a random number generator) with the idea of investigating them carefully, to gather knowledge that can be used about the rest:

Any help would be appreciated. I'll post my findings here. 67.122.211.178 (talk) 07:00, 7 September 2010 (UTC)[reply]

I think this is a very good way to start, although a bigger sample (say 50 articles) would be helpful. If the copyright violations are chronic, more extreme measures to correct them are implied, whereas if the copyright violations are occasional and sporadic, less draconian measures will probably suffice. Carrite (talk) 11:57, 7 September 2010 (UTC)[reply]
I hope we can spend at least a person-hour on each of the five, whether we find anything or not. We probably can't do that with 50 articles. This is supposed to be a small set of careful investigations aimed at identifying non-obvious problems and figuring out Darius's methods, not a statistical spot check to guess the overall violation frequency. If you want, I can generate a random set of 50 for spot-checking, but that's a different goal. 75.57.241.73 (talk) 20:37, 7 September 2010 (UTC)[reply]

This article is about a Spanish marathon runner. It is predated by a German wikipedia article (linked by interwiki), de:Rocío Ríos, and a Spanish one, es:Rocío Ríos. It mentions:

A resident of Gijón Ríos set her personal best (2:28:20) in the classic distance on October 15, 1995 in San Sebastián.

That "personal best" phrasing sounded a little bit formulaic so I googled it [3] and found a bunch of other articles created by Darius. One of them, Paula Fudge, was created a few months ago and gave a similar description of Fudge's personal best time, but that info wasn't in either of the references cited in the Paula Fudge article, so I asked on Darius's talkpage where the info originally came from. (Rocío Ríos's personal best time is mentioned in her iaaf.org profile in tabular form). There are a bunch of these biographies whose creation was spread out over time, making me wonder if there is a common source like a sports almanac or something like that. I hope Darius gives an answer. 67.122.211.178 (talk) 07:11, 7 September 2010 (UTC)[reply]

  • Most of Darius' articles use that "standard" wording. What struck me at first was the grammatical difference: some articles use commas correctly, but this one (and only a few others that I looked over) is missing a comma in phrases such as "A resident of Gijón Ríos set her personal best". Could this have been translated from a foreign-language source, or maybe it's his own writing, but he's just not a very good writer? I think that an almanac might be possible; did he create this alphabetically or by any other pattern? Usually, one would create all the stubs at once from a single source. I have a feeling, though, that he just started using boilerplate text from other articles (or that he first wrote) and made the stubs all read alike, but no copyvios.
    • I never considered the translation point until the interwikis were mentioned. See es translation and de translation: the en version seems like a copyedited translation of the es one. The wording is very close (note the "She is a four-time national champion in the 10,000 metres (1992, 1993, 1996, and 1997), and a three-time national champion in the half marathon (1992, 1994, and 1995)." bit and the combination of the "A resident of Gijón Ríos set her personal best (2:28:20)" bits from eswp. I think that many of these articles are trimmed translations based on some sort of boilerplate organization and presentation of key facts like records and victories, which seems especially likely if he is uncomfortable writing the article (even based off a translation) by himself. fetch·comms 00:38, 8 September 2010 (UTC)[reply]
      • The es article doesn't look anything like the en one to my eyes. The comma thing just seems like carelessness. Because of the Hara match above, I think the Rios article (and others like it) came from pages that were once on the IAAF site but currently aren't showing up. The IAAF search function is still (as of last night) not working either. Re interwiki, I'm actually more concerned about Darius's older articles getting translated/copied to other wikipedias, than the other way around. 75.57.241.73 (talk) 02:50, 8 September 2010 (UTC)[reply]

Again uses boilerplate text.[7] Gives a citation to a possible print source, "HistoFINA (Volume II, 1908-2001)". Later volumes of HistoFINA are online[8] but volume II is supposedly out of print ([9] p. 2).

The original author, Jean-Louis Meuret, is deceased (2008).[10]

75.57.241.73 (talk) 19:37, 7 September 2010 (UTC)[reply]

Worldcat lists only one library with the book, Universität Bern (OCLC 603448771). So Darius probably got the info online. 75.57.241.73 (talk) 21:51, 7 September 2010 (UTC)[reply]

Actually, there are several other worldcat records that could use merging[11] but this is still a very uncommon book. 75.57.241.73 (talk) 21:59, 7 September 2010 (UTC)[reply]

Seems fairly common wording to me. Not sure about the foundation of the charts; probably standard WP layout anyway. I don't think there's much chance of finding a cv in any of the "[Sport] at [Event]" articles; there's like two commonly-worded sentences and a list of results. fetch·comms 00:41, 8 September 2010 (UTC)[reply]

Unsourced stub which seems to be copyright clean. Carrite (talk) 11:53, 7 September 2010 (UTC)[reply]

I'm suspicious of the boilerplate phrasing "retired female volleyball player"[12]. Where did the data come from anyway? Will try to find a more recently created one. 75.57.241.73 (talk) 18:31, 7 September 2010 (UTC)[reply]

That seems like "standard" WP wording as wanted by the MOS. There's not really any other way to say it; for a nonretired player, we'd say "X is a female volleyball player from the United States" as well. Very little chance this is a cv, IMO. fetch·comms 00:43, 8 September 2010 (UTC)[reply]
What I want to find is where the data came from. This is something like the IAAF example further up. 75.57.241.73 (talk) 02:52, 8 September 2010 (UTC)[reply]

I ran Google searches on five substantial fragments of the article and they all came back clean to Wikipedia or to sources which seem to have drawn from Wikipedia. I made no effort to investigate the statistics in the sidebar box, but this article seems to pass copyright muster. Carrite (talk) 11:44, 7 September 2010 (UTC)[reply]

I don't particularly care about research "methods," the important thing is to identify and eliminate copyright violations. It seems that long blocks of prose are the greatest risk and from the two short pieces I looked at, there's no apparent issue. Darius absolutely did NOT rip off everything, it's just a question of quantifying the probable number of problem articles (700 to 1100 a reasonable guess range), figuring out how to find them expeditiously, and liquidating the problem. Carrite (talk) 02:25, 8 September 2010 (UTC)[reply]
I don't think we can "liquidate" the problem until we understand it. I don't think we can understand it without an analysis like this. 75.57.241.73 (talk) 02:43, 8 September 2010 (UTC)[reply]

Sample size

My back-of-the-envelope estimate, based upon the number of copyright violations found versus the number of articles that I looked at, was that around 10% of articles, just over a thousand, will turn out to be copyright violations. A sample size of five isn't nearly enough. Have ten articles to look at. (Even that's not enough.) Uncle G (talk) 15:57, 7 September 2010 (UTC)[reply]

Were those selected uniformly at random from the whole set of Darius-created articles? If yes, it's surprising that they're all biographies. Anyway the point of selecting the 5 articles wasn't to find vios per se, but to examine them carefully to see if anything could be learned about Darius's methods. So I'd rather keep examining the original 5 for a while before expanding the set. 75.57.241.73 (talk) 18:23, 7 September 2010 (UTC)[reply]
Steve Spence
Leonard Nitz
  • Green tickY cv of [13]. All the copied content was added word-for-word by Darius as his second edit; the original stub seems to have gotten all the info from the cv source. I have stubbified the article for now, as the original stuff doesn't seem to be a vio. Darius was clearly getting his info from a source, writing a quick stub, and pasting in the rest a few days later, without listing the offending link as a source. fetch·comms 00:59, 8 September 2010 (UTC)[reply]
Steffen Radochla
Mark Gorski
Gerrit de Vries (cyclist)
Lauren Hewitt
Japhet Kosgei
Lee Naylor (athlete)
George Mofokeng (athlete)
Gert Thys

Technical question

Couple of things; one: has any of the triage stuff been listed (i.e., eliminating edits where he only touched categories, etc.)? Two: can some make a quick list of all articles he made (just page 5 of the CCI for now) where he made at least one edit after initial creation that added more than 500b to an article (ignoring category-only edits, etc. preferably)? This may help to see if he often created a stub first, then pasted in a couple paragraphs later. If this is not technically feasible, I'll just keep looking manually. fetch·comms 01:18, 8 September 2010 (UTC)[reply]

1) slightly complicated but doable, I've just been juggling other things. 2) easier, I'll see if I can bang it out. 75.57.241.73 (talk) 02:19, 8 September 2010 (UTC)[reply]
It looks like the CCI report already has this info (the additions are broken out as separate edits). What is it that you're asking for that's not already there? Note the threshold is probably more like 100b than 500b. Even 3-word snippets like "retired female swimmer" is enough to pick up some vios. 75.57.241.73 (talk) 02:24, 8 September 2010 (UTC)[reply]
I want a list of the pages where at least one of the later additions was over 500b (removing the lesser ones). I know it's not perfect; just searching for a possible pattern, seeing how much he may have lifted at once, and then narrow it down from there. If possible, just run through the existing CCI page and get a list of the articles with a diff that says more than 500. fetch·comms 02:47, 8 September 2010 (UTC)[reply]
Lifting a three word phrase like "retired female swimmer" isn't technically a copyright violation, although it might be a pointer to a section of lifted text. Carrite (talk) 02:29, 8 September 2010 (UTC)[reply]
Right, finding the same phrase in dozens of articles can point to a common source. 75.57.241.73 (talk) 02:30, 8 September 2010 (UTC)[reply]
Fetchcomms, it looks like articles with that pattern (contribution > 500b on other than the first edit) on page 5 are easy to spot by eyeballing. Do you want something more than that? 75.57.241.73 (talk) 03:06, 8 September 2010 (UTC)[reply]
Meh, alright. I'm probably just getting lazy :P. Working on them now... until I sleep. fetch·comms 03:18, 8 September 2010 (UTC)[reply]
Source to watch out for: I found two copyvios so far (he slightly changed it by merging some sentences and reordering them) from http://www.hockey.org.au/index.php[15] and [16]. Both were in the original creation, it seems, so for field hockey articles, that seems like a source he may have used repeatedly. Is there a way to list all the articles he created that are part of Category:Australian field hockey players, so we can check against this site? fetch·comms 03:42, 8 September 2010 (UTC)[reply]
Yeah, give me a few minutes. 75.57.241.73 (talk) 04:10, 8 September 2010 (UTC)[reply]
They are below, feel free to uncollapse the list or move it. I can add links to the 1st rev of each article if that is useful. 75.57.241.73 (talk) 05:04, 8 September 2010 (UTC)[reply]

Fetchcomms, please read Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help. This has already been written down. Uncle G (talk) 11:09, 8 September 2010 (UTC)[reply]

  • I assume you mean his strategy (I didn't find any links there). That page is very helpful, though hopefully we can add more info as we progress. fetch·comms 12:50, 8 September 2010 (UTC)[reply]
    • Yes, the strategy. I've already invited everyone to be bold in adding to and improving that page. If you want a list of the created articles, simply go back in the edit history of the first two CCI list pages to before the list was sorted and revamped.

      I've reviewed a few hundred of the biographies. The common creation strategy was for the whole text to be in the first edit. But sometimes there are a few later edits fixing copy and paste errors. As I wrote above, a productive triage approach is to first go back to the latest revision by Darius Dhlomo before someone else (aside from a 'bot) touched the article, and read that. Check the foundation content first, in other words. Uncle G (talk) 13:18, 8 September 2010 (UTC)[reply]

More strategems that you can incorporate into Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help (be bold!):

  • Articles that cite the Beach Volleyball Database have uniformly proven to be taken from the biographies there. The BVB pages aren't datestamped, unfortunately. But Moonriddengirl and I did some checking with the Wayback Machine to check the relative dates.
  • If an article cites a "profile" somewhere, it's quite productive to check that profile first. Unfortunately, some profiles pointed to have been removed from the WWW in the intervening years. The Wayback Machine is of some help, here.
  • As discussed above, if an article cites an IAAF profile, there's a likelihood that the prose came from another article somewhere else on the IAAF site. (I originally skipped a lot of articles that cited IAAF profiles, because I wasn't aware of the other pages.)

Uncle G (talk) 13:32, 8 September 2010 (UTC)[reply]

Field hockey players

Articles from Category:Australian field hockey players created by Darius, per Fetchcomm's request.

collapsed list of 92 players

75.57.241.73 (talk) 04:40, 8 September 2010 (UTC)[reply]

Temporary list of known used sources for Australian field hockey players

I'm just using this for myself right now, but a central list of sources he copies from frequently could be useful in identifying some vios, as Google does not show all of these on the first page or two. fetch·comms 15:44, 8 September 2010 (UTC)[reply]

Have we alerted the appropriate Wikiprojects and enlisted their help?

As it stands, a casual survey of the articles in question seem to limit the fields to mostly athletics and specific sports within them. Have the appropriate Wikiprojects been contacted and enlisted to try to help? I could see a bot option that is being suggested below to be very effective if there are dedicated members of the affected projects getting involved to help clean up stuff, using a coordination page to drop admin help requests when needed. --MASEM (t) 16:03, 8 September 2010 (UTC)[reply]

Notice of the CCI has been given to WikiProject Athletics and WikiProject Olympics. Some projects are very responsive to these. This CCI's been a little unusual; I don't know if the people who've expressed interest in helping have been pointed to this discussion. It seems like it would be good to link to this discussion from the CCI page; I'll do that now. --Moonriddengirl (talk) 16:09, 8 September 2010 (UTC)[reply]

Implementing bot?

At this point, I propose that we go ahead with the following, based on discussions above:

The advantages of this: we pull the content from publication immediately, and we invite the wider community to help with cleanup. This could be the most efficient means of addressing a CCI ever, and it may not linger for more than a year as some of our others have done. There is a substantial risk that some of these tags will simply be removed by users who don't care about copyright. I see this routinely at WP:CP. We try to address this at WP:CCI by requiring that only those who have themselves had no copyright issues assist, but this isn't foolproof.

Still to be determined: what then? At what point do we go through the ones still tagged?

Thoughts? --Moonriddengirl (talk) 12:55, 8 September 2010 (UTC)[reply]

Go for it. After that, just keep checking manually, I guess. fetch·comms 13:25, 8 September 2010 (UTC)[reply]
  • I realize I'm in the minority by now, so I have to say this--not to get into further debate about it but just to indicate that there are still some of us who feel this way--but I still favor the mass deletion approach over any of these schemes for sucking up massive amounts of community effort cleaning up Darius's mess (plus exposing everyone who touches any of those articles to potential legal liability). The articles aren't for the most part really articles at all. They're more like database dumps that Darius vacuumed from various places into WP article space. None of them are written from secondary sources as our articles are supposed to be. Yes I know some of them are about legitimately notable people. I just don't feel any sense of tragedy that there might exist a notable person someplace in the world who is temporarily not the subject of a WP article, at least til somebody else gets around to writing a real one with real sourcing.
  • That said, I wonder if we could deploy some additional automation to help with cleanup. Is there some kind of script around that integrates all the diffs from when an article was created, in order to highlight all the text in the last revision, that was originally put in by a particular editor (i.e. Darius)? I can probably write one, but it would surprise me if it hasn't been done already. On the other hand it wouldn't be perfectly accurate. (Update: someone at refdesk mentions User:Cacycle/wikEdDiff which is not what I had in mind, but looks interesting anyway).
  • Do you want any additional filtering or processing of the 23000 articles? Like maybe for articles creates by people other than Darius, instead of blanking the whole article, the bot could revert to just before Darius's first edit to it. We'd write a different template for articles that got reverted but not blanked. Also I can still attempt some of the triage stuff discussed above, like noticing category-only edits with a script (I just have RL things to do as well). 75.57.241.73 (talk) 13:37, 8 September 2010 (UTC)[reply]
  • I completely understand favoring the mass deletion approach. I was leaning that way myself when we started. But so many people have been putting their time into this already, and I do not wish to devalue their efforts. Too, this approach has some exciting prospects for future cases like this. Getting assistance at CCI is a challenge; most of them involve thousands of articles, though few are this scale. If we find that this approach actually works, then it may be useful for other similar CCIs down the road...a way to encourage involvement from those members of the community who actually do view these articles. If this leads to finding a new, viable system for these, we might not have dozens of open CCIs with probably hundreds of thousands of articles cumulatively waiting for view. (oi)
  • I have no idea what automation can do. I'm technologically in the school of "challenged by using my remote control." I don't know of any script that integrates the diffs or how we might process it to automatically revert back to the pre-Darius version, but if those things are possible, they might be good approaches. I already have a notice for talk pages about rolling back CCI articles: User:Moonriddengirl/CCIr. I only use it when there is evidence of copying, but it could be easily modified to this situation.
  • Do you write scripts? There are several ideas I have for copyright cleanup tools that I would love to see in the works. If you do and you're up for it, come by my talk page. :D (Note, though, that I am technologically clueless. I never know if my ideas are in the realm of "easily accomplished" or "needs a magic wand.") --Moonriddengirl (talk) 14:17, 8 September 2010 (UTC)[reply]
  • The bit about scripts is still very useful, as we should probably focus first on the articles he created, which have a greater likelihood of diffs containing vios rather than just extra categories. fetch·comms 14:55, 8 September 2010 (UTC)[reply]

75.57.241.73, blanking the articles now does not preclude deleting them later, if we find that in six months we still have ten thousand blanked articles. But going straight to deletion immediately precludes any other approach. (You would have to get someone else to volunteer to do that, in any event. None of my 'bot accounts have sysop privileges, and I'm not going to do 'bot edits with this account.)

I reiterate my request for everyone to please boldly fix anything in Wikipedia talk:Contributor copyright investigations/Darius Dhlomo/How to help, Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation, and User:Moonriddengirl/CCIdf — which latter I suggest reside somewhere like Wikipedia:Contributor copyright investigations/Darius Dhlomo/Article notice or in the Template: namespace (if we aren't unhappy about Wikipedia mirrors showing the same notice). If we're going to do this, I want those all thoroughly reviewed beforehand. Uncle G (talk) 15:25, 8 September 2010 (UTC)[reply]

What should do about translations of these copyvios in other languages? Is there a way to get a list of all interwikis created after Darius did here? If the en versions are determined to be vios, the other language ones need to go as well. fetch·comms 15:39, 8 September 2010 (UTC)[reply]

FWIW, from the few articles I've looked at, I've seen the following patterns:

  • Articles of the type "[insert name of country] at the [insert year] [insert competition]" are quite frequently little more than a bit of boilerplate at the top, a formatted list in the middle, & the expected stuff at the end. Probably best examined by hand in case he added any further text -- which is likely a copyvio. (And the subject name does vary a bit.)
  • Same for individual sports at the Olympics, Pan-American games, etc. Same treatment.
  • Biographical articles that are not minimal stubs seem to routinely have copyvio material in them. If we can determine the cut-off size for these -- where the visible text is more than "X [insert birth & death information] is a [insert country] [insert athletic specialty]. He/she was active [insert length of career]" -- those could either be safely deleted or stubbified.
  • My experience confirms Uncle G's estimate that around 10% of his articles are copyright violations; the rest are simply stubs. Deleting the 90% of acceptable -- & likely useful -- articles just to purge this poisonous share is overkill -- unless one believes all stubs are potential maintenance problems & should be deleted. (Not arguing for or against that opinion, but I suspect it is a motivation of some of those who favor mass deletion.) -- llywrch (talk) 16:02, 8 September 2010 (UTC)[reply]
  • Do we need a bot to do this? According to WP:Administrator "The English Wikipedia has 1,755 administrators as of September 8, 2010." I know some are retired/inactive, but if each administrator looked over 10+/- articles they would all be reviewed within a day. If the article is a copyright violation they can delete it. This would save all of the copyright violation free articles. Of course this would take quite a bit of organization, but it is just an idea. --Alpha Quadrant (talk) 17:27, 8 September 2010 (UTC)[reply]
  • If they would, we wouldn't. But this has been publicized at both admin noticeboards as well as plenty of points around Wikipedia, and so far we've got nowhere near full admin participation. I suspect we're not going to get even a tenth of them involved. --Moonriddengirl (talk) 17:34, 8 September 2010 (UTC)[reply]
  • If every admin wrote four FAs a year, we'd be well on our way to a better encyclopedia, but there's no way a thousand people will be coaxed into even reviewing one article. Whoever can help, please help. Otherwise, we can't ask any more of users who are already busy in RL. I've personally postponed some writing goals to work with this CCI because I value copyright very seriously, and I realize this is an understaffed area. But I can't speak for others' priorities. fetch·comms 18:11, 8 September 2010 (UTC)[reply]