Wikipedia:Administrators' noticeboard/Incidents/CCI/Archive 1
CCI pages | |
---|---|
CCI case main page 'bot task explanation how to help 'bot discussion cleanup discussion changes to the 10,000 articles list of tagged articles | |
Policy | |
Copyright policy | |
On this page | |
Original discussion from Wikipedia:Contributor copyright investigations
Review of unblock request and discussion of possible community ban
Darius Dhlomo (talk · contribs) is a prolific editor, with over 163,000 edits since 2005. However, he is currently blocked for multiple copyright violations, and is requesting an unblock. A few different admins have looked at his case, but I believe a community consensus is required, as I believe a permanent ban is possibly warranted. Darius has a long history of ignoring other editors; perusing his talk page history you'll see that many editors have tried to engage him in discussion about some questionable edits over the years, but Darius has never bothered to reply, save for the occasional section or page blanking. Only now, upon facing an indefinite block, does he appear to show the slightest bit of understanding or remorse over his actions, although the warnings have been given to him for several years. While much of his work is commendable, he does appear to want to work in a vacuum, ignoring not just policy but also conventions and consensus that he doesn't agree with. (Note that only about 1 out of every 900 edits he has ever made have been to talk pages!) I have never believed that "vested editors" should be given more leeway than anyone else.
I seek feedback from the wider community here about what to do with Darius Dhlomo. I do not believe he should be unblocked without a more thorough review of his editing history, and not just based on his current talkpage expression of remorse. — Andrwsc (talk · contribs) 17:38, 4 September 2010 (UTC)
- I'm not sure there is any need for this. He is blocked, and I have just declined his latest unblock request and directed him to consider WP:OFFER instead. Beeblebrox (talk) 17:44, 4 September 2010 (UTC)
- Fair enough, but there were comments such as "I'm verging on unblocking" so I wanted to make sure there was actually consensus to do so (or not) instead of the decision being made by a single admin. — Andrwsc (talk · contribs) 20:41, 4 September 2010 (UTC)
- Comment: not familiar with this case, but the way he repeatedly put "damage" in scare quotes in unblock requests makes me disinclined to give him another chance, at least without successful compliance with drastic restrictions (eg 3-month ban on article edits - talk pages only). Rd232 talk 17:45, 4 September 2010 (UTC)
- JamesBWatson sums up pretty well at user talk. I wouldn't support an unblock, per Beeblebrox and Rd232 above. --John (talk) 17:50, 4 September 2010 (UTC)
- Comment Why in the world is "Community ban" in this? The editor doesn't get it ... there's no need to jump to that level of drama. As an admin who declined unblock once ... and have actually tried to help them, and even at one point was prepared to support an unblock (but not anymore), as James has said, everything he types just makes it worse. WP:OFFER (talk→ BWilkins ←track) 18:11, 4 September 2010 (UTC)]
- Let me add: thanks to NW for removing autoreviewer from them. How dumb of us not to remove it when the first copyright issues arose years ago then never stopped. (talk→ BWilkins ←track) 19:40, 4 September 2010 (UTC)
- Comment At present I haven't got much to add to what I have written on the user's talk page. However, I will just say that I don't think the user should be unblocked, but that I don't see any reason for a community ban. Wikipedia:Standard offer has been presented, and I would leave it at that for now. JamesBWatson (talk) 18:30, 4 September 2010 (UTC)
- See no need for a community ban, he's already rightfully indefinitely blocked, properly referred to WP:OFFER and should be blocked from editing his talk page if he keeps asking to be unblocked. --WGFinley (talk) 15:12, 5 September 2010 (UTC)
Scale of the problem
What's the scale of the copyright problem here? I've identified these so far: Sammy Korir (2006) Joetta Clark (2007) Canyon Ceman (2008, notice removed by this same editor and still a copyright violation right now), and Phil McMullen (athlete) (2010). I am unable to determine why the 'bot thought that Kamil Damašek was a copyright violation. Is this all that there is? Uncle G (talk) 20:29, 4 September 2010 (UTC)
- We don't know yet. Within the past five weeks, a contributor has removed content from Julie Isphordin (2010), Jerome Drayton (2006), Ron Tabb (2010), Steve Spence (2007), Nina Kuscsik (2009), and Maurcie Damilano (2010). There's a long way to go before we know the scale here. --Moonriddengirl (talk) 21:21, 4 September 2010 (UTC)
- Randomly jumping around his contribs, I've just found four non-vios and one pretty egregious one, from 2006, content still in the article as of today: Jill Sterkel. We've got a CCI going here at Wikipedia:Contributor copyright investigations/Darius Dhlomo, but we haven't completed listing yet. Contribs of this scale challenge our system. --Moonriddengirl (talk) 21:36, 4 September 2010 (UTC)
- Well just this evening I have started going through the CCI and because I haven't done this before I thought I would start on the ones that seem to be mainly a recitation of results. I have already flagged one definite copyvio and have just come across 3 more articles this, this and this which I am fairly sure are copyvios too. That's four more probables from a fairly small sample. Boissière (talk) 21:50, 4 September 2010 (UTC)
- I had my doubts about those edits. I'm very aware of the IAAF's style and the odd full caps name of NOOL was a give away. Copied sources for those three articles are here here and here. Stuff like this is often beyond the realms of search engines, but its location is second nature for perennial time-wasters such as myself. Sillyfolkboy (talk) (edits)Join WikiProject Athletics! 00:57, 5 September 2010 (UTC)
- Good going, both of you! (Can we have you full time? :D) So there can be no doubt of backwards copying, I've checked archives of each of those sources. He did indeed paste the content onto Wikipedia. He assures us at his talk page that he's done this is in "No more than fifteen" articles. This is 15 right here. It would be nice to think that will be all, but I am not optimistic. --Moonriddengirl (talk) 01:08, 5 September 2010 (UTC)
- Make that 16. I've just found Ben Plucknett, which was swiped wholesale from the New York Times. Uncle G (talk) 01:48, 5 September 2010 (UTC)
- 17: Eileen Coparropa swiped wholesale from The Panama News. Uncle G (talk) 01:58, 5 September 2010 (UTC)
- 18: Ragnhild Hveger nicked from an ISHOF profile. Uncle G (talk) 02:11, 5 September 2010 (UTC)
- 19: Marina van der Merwe plagiarized from a Coaching Association of Canada press release, of all things. Uncle G (talk) 02:15, 5 September 2010 (UTC)
- 20: Edward Liddie lifted wholesale from the 2005 World Judo Championship press pack. Uncle G (talk) 02:22, 5 September 2010 (UTC)
- 21: Alison Forman half-inched from a 2000 ABC News profile for the 2000 olympics. Uncle G (talk) 02:26, 5 September 2010 (UTC)
- 22: Francisco Hervás nabbed from 2003 FIVB coach profile. Uncle G (talk) 02:34, 5 September 2010 (UTC)
- 23: Bert Goedkoop hooked from another 2003 FIVB coach profile. Uncle G (talk) 02:38, 5 September 2010 (UTC)
- 24: Jeff Stork ripped from a profile in The Washington Post. Uncle G (talk) 02:41, 5 September 2010 (UTC)
- 25: Pete McArdle janked from an 1985 obituary in the New York Times. Uncle G (talk) 02:48, 5 September 2010 (UTC)
- 26: Vera Nikolić clipped from a profile by Belgrade Marathon Ltd. Uncle G (talk) 03:04, 5 September 2010 (UTC)
- 27: Lynda Blutreich copped from a profile by the Carolina Tar Heels hall of fame. Uncle G (talk) 03:10, 5 September 2010 (UTC)
- 28: Lionel Cox popped from a WWW republication of a 1984 book by Graeme Atkinson. Uncle G (talk) 03:26, 5 September 2010 (UTC)
- 29: Mike Fibbens ganked from a July 2000 Associated Press article. Uncle G (talk) 03:51, 5 September 2010 (UTC)
- 30: Jason Grimes filched from a profile by Maryland Track & Field. Uncle G (talk) 12:00, 5 September 2010 (UTC)
- 31: Caren Kemner swiped from a profile in The Washington Post. Uncle G (talk) 12:09, 5 September 2010 (UTC)
- 32: Rita Crockett blagged from a profile by CBS Sports. Uncle G (talk) 12:58, 5 September 2010 (UTC)
- 33: Philippe Blain misappropriated from another 2003 FIVB coach profile. Uncle G (talk) 13:43, 5 September 2010 (UTC)
- 34: Mikiyasu Tanaka purloined from another 2003 FIVB coach profile. Uncle G (talk) 14:01, 5 September 2010 (UTC)
- 35: Vassiliki Arvaniti pilfered from a Beach Volleyball Database biography. Uncle G (talk) 16:51, 5 September 2010 (UTC)
- Good going, both of you! (Can we have you full time? :D) So there can be no doubt of backwards copying, I've checked archives of each of those sources. He did indeed paste the content onto Wikipedia. He assures us at his talk page that he's done this is in "No more than fifteen" articles. This is 15 right here. It would be nice to think that will be all, but I am not optimistic. --Moonriddengirl (talk) 01:08, 5 September 2010 (UTC)
- I had my doubts about those edits. I'm very aware of the IAAF's style and the odd full caps name of NOOL was a give away. Copied sources for those three articles are here here and here. Stuff like this is often beyond the realms of search engines, but its location is second nature for perennial time-wasters such as myself. Sillyfolkboy (talk) (edits)Join WikiProject Athletics! 00:57, 5 September 2010 (UTC)
- Well just this evening I have started going through the CCI and because I haven't done this before I thought I would start on the ones that seem to be mainly a recitation of results. I have already flagged one definite copyvio and have just come across 3 more articles this, this and this which I am fairly sure are copyvios too. That's four more probables from a fairly small sample. Boissière (talk) 21:50, 4 September 2010 (UTC)
- Randomly jumping around his contribs, I've just found four non-vios and one pretty egregious one, from 2006, content still in the article as of today: Jill Sterkel. We've got a CCI going here at Wikipedia:Contributor copyright investigations/Darius Dhlomo, but we haven't completed listing yet. Contribs of this scale challenge our system. --Moonriddengirl (talk) 21:36, 4 September 2010 (UTC)
(de-indenting) Any talk to of either an unblock or a community ban is premature until the CCI finishes. At first blush, though, this sure looks pretty grim. Nandesuka (talk) 23:15, 4 September 2010 (UTC)
- I've read much of Darius Dhlomo's editing in my time here. This may sound drastic, but I would say that any article creations with more than four sentences of prose are highly suspect. Any edits adding three or more sentences of prose are also suspect. Believe it or not, despite the edit count, I guarantee that you will not find many edits which fall within this description. Most of Darius' larger edits will be adding tables/templates etc but I believe a small minority of these will yield copyright violations. I expect that these violations with be confined to large edits to biographies and event results articles (like those linked above). Sillyfolkboy (talk) (edits)Join WikiProject Athletics! 01:43, 5 September 2010 (UTC)
- It doesn't sound drastic at all, unfortunately. I've just gone through about 40 biographies in the CCI list, and there's a definite pattern emerging. Uncle G (talk) 01:48, 5 September 2010 (UTC)
- When one sees a CCI listing section heading that says "articles 9661 through 9664", it's fairly daunting. Any and all help is most welcome. Uncle G (talk) 01:48, 5 September 2010 (UTC)
- Person has created almost 10000 articles and almost all are cruft. Why not launch a bot to delete them all. 67.122.211.178 (talk) 08:38, 5 September 2010 (UTC)
- They may seem like cruft to people not interested in the topic, but I can tell you that there are plenty of people who think its worth writing about former European champions and world record holders such as Vera Nikolić. Sillyfolkboy (talk) (edits)Join WikiProject Athletics! 09:58, 5 September 2010 (UTC)
- If they were so important other people would have written about them. How about deleting all the ones that have less than 500 characters of content added by other editors. That would still get most of them, making the remaining problem a lot smaller. Any that are worth having will get recreated by someone else sooner or later. 67.122.211.178 (talk) 17:27, 5 September 2010 (UTC)
- "If they were so important other people would have written about them" is a completely bogus argument. GregorB (talk) 20:33, 10 September 2010 (UTC)
- I agree, this type of thinking simply does not apply to the topic areas people like Darius, Gregor and I work in. No one wrote a jot about the 1986 Goodwill Games (a event which was both large and American/Russia-centric) until June this year – this is just a hint of the number of other notable events which remain poorly covered (or not at all!) SFB/talk 21:37, 10 September 2010 (UTC)
- "If they were so important other people would have written about them" is a completely bogus argument. GregorB (talk) 20:33, 10 September 2010 (UTC)
- If they were so important other people would have written about them. How about deleting all the ones that have less than 500 characters of content added by other editors. That would still get most of them, making the remaining problem a lot smaller. Any that are worth having will get recreated by someone else sooner or later. 67.122.211.178 (talk) 17:27, 5 September 2010 (UTC)
- To be honest I tend to agree with the bot idea, or at least deleting on sight anything more than a few sentences. Frankly I think the community shouldn't use resources trying to save or even think twice about these articles when the problem is of such a scale... --Mkativerata (talk) 10:24, 5 September 2010 (UTC)
- They may seem like cruft to people not interested in the topic, but I can tell you that there are plenty of people who think its worth writing about former European champions and world record holders such as Vera Nikolić. Sillyfolkboy (talk) (edits)Join WikiProject Athletics! 09:58, 5 September 2010 (UTC)
- Person has created almost 10000 articles and almost all are cruft. Why not launch a bot to delete them all. 67.122.211.178 (talk) 08:38, 5 September 2010 (UTC)
- Before I get started, let me verify: Results are not copyright violations? The events freely disseminate this information and once copied are public information. The short statement of facts at the top of the article: where it happened, when etc. are also not violations. Sarcasto (talk) 17:50, 16 September 2010 (UTC)
Ban
Per the above, which appears to show that this person is a serial copyright violator, I propose a community ban. 160,000+ edits or not, we do not need these headaches. So what now, we will have to dig through all of his contributions looking for things he stole wholesale from others? Kindzmarauli (talk) 03:34, 5 September 2010 (UTC)
- Could we get him involved in cleaning up his own copyvios? 67.122.211.178 (talk) 07:23, 5 September 2010 (UTC)
- Given that they continually denied that there was any problem at all in the face of diffs, could we actually trust him to clean up his own copyvios? Normally I'm all for any help we can get with copyvios, but I'm not seeing any indication that they actually would (could?) help. VernoWhitney (talk) 12:29, 5 September 2010 (UTC)
- 67.122.211.178, I suggest that you try doing so. Go to User talk:Darius Dhlomo and participate in the discussion of that very thing, there. Uncle G (talk) 12:54, 5 September 2010 (UTC)
- He may be able to help identify sources, but my initial idea to have him review these to see which had more than a few sentences was obviously naive. I realized he was obscuring the problem, but I did not realize that he would obscure it so far as to assure us that this happened in no more than 15 articles, when this is so obviously not the case. We can't trust him to help identify issues. :/ --Moonriddengirl (talk) 13:49, 5 September 2010 (UTC)
- Well he doesn't (looking at his talk page) seem to be able to help with sources all that much either. I was really disappointed with his reply to my questions - instead he seems to be suggesting that it is the fault of the community for not giving him harsh enough warnings and he felt that, despite the ones he got, he was doing a "fine job". It's not as if the notices are unclear about policy :( --Errant [tmorton166] (chat!) 14:03, 5 September 2010 (UTC)
- He may be able to help identify sources, but my initial idea to have him review these to see which had more than a few sentences was obviously naive. I realized he was obscuring the problem, but I did not realize that he would obscure it so far as to assure us that this happened in no more than 15 articles, when this is so obviously not the case. We can't trust him to help identify issues. :/ --Moonriddengirl (talk) 13:49, 5 September 2010 (UTC)
(outdent) So nobody supports the community ban proposal? Kindzmarauli (talk) 23:31, 6 September 2010 (UTC)
- I'd been waiting to see if he was going to be useful in evaluating his articles. Based on his talk page, it doesn't look like it. He has failed to identify several copyvios. At this point, I'm not sure who would be willing to lift his block. --Moonriddengirl (talk) 23:51, 6 September 2010 (UTC)
- Right now we have an active indef block, which seems fine for the moment. I don't think the "nobody gave a final warning or shorter block" is a valid excuse for this, but it's something for the admin/CC community to learn from. Users as unresponsive as Darius sometimes require LART to catch on to how serious their problems are. 67.122.211.178 (talk) 03:22, 7 September 2010 (UTC)
- Late to the party but I endorse a community ban but with a phased-in readmission to the community and only a partial ban from discussion pages. Full privileges should not be restored for at least 12 months and should be proceeded by a significant period of supervised/mentored editing so he can show he is serious about creating original content. In any case, the only contributions he should be allowed to make in the next 12 months should be non-article space contributions that are relevant to this case or to the cleanup effort, such as lists of references which can help others write non-infringing content. In other words, his "block" should be lifted so he can edit discussion pages but his "ban" on editing article pages should remain for at least 12 months. I also recommend he be on some type of probation/parole for the first 12 months after his supervised editing period ends. In short, it will be a minimum of 2 1/2 years or so before he's back in good standing: 12 months of edits to relevant discussions only, 6+ months of supervised article editing, and 12 months of "we are watching you" parole. davidwr/(talk)/(contribs)/(e-mail) 14:04, 14 September 2010 (UTC)
Mass deletion: Give up and start over
- Contributor copyright investigations/Darius Dhlomo
- Contributor copyright investigations/Darius Dhlomo/Created articles list
- Contributor copyright investigations/Darius Dhlomo/How to help
- Contributor copyright investigations/Darius Dhlomo/Notice
- Contributor copyright investigations/Darius Dhlomo/Task explanation
- Contributor copyright investigations/Darius Dhlomo 1
- Contributor copyright investigations/Darius Dhlomo 10
- Contributor copyright investigations/Darius Dhlomo 11
- Contributor copyright investigations/Darius Dhlomo 12
- Contributor copyright investigations/Darius Dhlomo 13
- Contributor copyright investigations/Darius Dhlomo 14
- Contributor copyright investigations/Darius Dhlomo 15
- Contributor copyright investigations/Darius Dhlomo 16
- Contributor copyright investigations/Darius Dhlomo 17
- Contributor copyright investigations/Darius Dhlomo 18
- Contributor copyright investigations/Darius Dhlomo 19
- Contributor copyright investigations/Darius Dhlomo 2
- Contributor copyright investigations/Darius Dhlomo 20
- Contributor copyright investigations/Darius Dhlomo 21
- Contributor copyright investigations/Darius Dhlomo 22
- Contributor copyright investigations/Darius Dhlomo 23
- Contributor copyright investigations/Darius Dhlomo 24
- Contributor copyright investigations/Darius Dhlomo 3
- Contributor copyright investigations/Darius Dhlomo 4
- Contributor copyright investigations/Darius Dhlomo 5
- Contributor copyright investigations/Darius Dhlomo 6
- Contributor copyright investigations/Darius Dhlomo 7
- Contributor copyright investigations/Darius Dhlomo 8
- Contributor copyright investigations/Darius Dhlomo 9
This has been raised above, inline, but I wanted to call it out for discussion here. As near as we can tell, everything this person has written that is longer than a few sentences is a copyvio. This means that the well of articles he has created -- barring the ones that are simple lists of data -- are, quite simply, poisoned foundations upon which we're letting others build.
I propose that we delete every article this user has created, with an exception carved out for data-only articles like lists of winners. This will no doubt be upsetting to those who have worked on those articles since, but I don't see any other way to fairly respect the rights of the original content creators and abide by our own policies. It's very likely that there are others willing to step in and create replacement articles afresh, and I'd rather encourage that than continue to build atop a weak foundation. The task being asked of the CCI - verify the copyright status of over six thousand freaking articles - is, quite simply, beyond what anyone should be asked to do. It is a Sisyphean task. So, if you'll permit me another Hellenic analogy, let's cut the Gordian knot and start with a clean slate.
Comments? Nandesuka (talk) 14:18, 5 September 2010 (UTC)
- It's not a Sysiphean task. It's a Herculean task. It's the Augean Stables, to be precise. Uncle G (talk) 14:56, 5 September 2010 (UTC)
- I am loathe to do this under ordinary circumstances, but these circumstances are not ordinary. In addition to the tens of thousands of articles at this CCI, we have dozens of other CCIs, some over a year old, with additional tens of thousands of articles that need review...and where copying is not this blatant. In this circumstance, I'd support mass deletion or at least reduction of articles to a one sentence stub. (Please note: that's what we did with the last comparable CCI ([[1]]), and it still took us a year.) --Moonriddengirl (talk) 14:21, 5 September 2010 (UTC)
- (I should note that the tens of thousands of articles to which I refer are not only the ones he's created, but the ones to which he's substantially contributed: hiding reverts and minor edits, that's 23,197 to be precise. --Moonriddengirl (talk) 14:29, 5 September 2010 (UTC))
- 23,197 edits or 23,197 individual articles? Uncle G (talk) 14:37, 5 September 2010 (UTC)
- Individual articles. Mind-boggling, I know, but to quote exactly: "23197 articles from timestamp 2005-11-09 06:15:32 UTC to timestamp 2010-08-30 22:03:25 UTC." I don't know how many edits that represents, but the first on his full contrib list shows 19 non-minor edits to a single article alone. --Moonriddengirl (talk) 14:46, 5 September 2010 (UTC)
- When I started going over the biographies some hours ago I started to get a grasp of the scale of the problem here, and I came to much the same conclusion based upon the evidence before me that you, Mkativerata, and Sillyfolkboy all apparently already have: Any flowing prose in an article created by this person was written by someone else. It was either written by a subsequent Wikipedia editor or plagiarized from somebody else's writing by Darius Dhlomo. I thought that the problem wasn't going to get larger.
However, that article count manages to do exactly that. My perspective on that is that it is of a similar scale as reviewing my contributions (under just this account, not my 'bots or before I had an account). I have, as Uncle G (talk · contribs), touched fewer pages, across all namespaces taken together, than that. Uncle G (talk) 15:23, 5 September 2010 (UTC)
- When I started going over the biographies some hours ago I started to get a grasp of the scale of the problem here, and I came to much the same conclusion based upon the evidence before me that you, Mkativerata, and Sillyfolkboy all apparently already have: Any flowing prose in an article created by this person was written by someone else. It was either written by a subsequent Wikipedia editor or plagiarized from somebody else's writing by Darius Dhlomo. I thought that the problem wasn't going to get larger.
- Individual articles. Mind-boggling, I know, but to quote exactly: "23197 articles from timestamp 2005-11-09 06:15:32 UTC to timestamp 2010-08-30 22:03:25 UTC." I don't know how many edits that represents, but the first on his full contrib list shows 19 non-minor edits to a single article alone. --Moonriddengirl (talk) 14:46, 5 September 2010 (UTC)
- 23,197 edits or 23,197 individual articles? Uncle G (talk) 14:37, 5 September 2010 (UTC)
- (I should note that the tens of thousands of articles to which I refer are not only the ones he's created, but the ones to which he's substantially contributed: hiding reverts and minor edits, that's 23,197 to be precise. --Moonriddengirl (talk) 14:29, 5 September 2010 (UTC))
- I'd prefer a different approach, of finding some way in which we can rapidly trim the current CCI listing. Then we can review what is left to see whether we still have an unmanageable problem. Is there some set of criteria that we can mechanistically apply to rapidly eliminate the hundreds of 1-paragraph pretty much data-only stubs that this person has made? There are quite a few of them, and eliminating them I suspect would reduce the size of the problem significantly. Moonriddengirl, what is your view on the possible copyright infringement status of articles such as … spins wheel … Jennifer Whittle for example? Uncle G (talk) 14:37, 5 September 2010 (UTC)
- Minimal creativity, minimal content. I would regard that as a safe stub. If those couple of sentences were highly idiosyncratic, I'd probably look for a source. :) (Compelled to come back and clarify: I'm not saying that could not be a copyvio; it could, if it copies from another sources and especially if it is one of dozens of articles he's copied from that same source, which would clearly not be a de minimis situation. This is a risk assessment question.) In terms of other alternatives, there is an image-based CCI on which I'm working that is not this scale where most of the images are free of copyright problems. I am mass sorting these to separate out the ones that need review. If we had somebody of great patience who could separate these articles according to "likely to be a problem" and "not at all likely to be a problem", that would help. I had planned to ask the contributor to do that himself, but, as I said above, I'm no longer sure we could trust him with that task. --Moonriddengirl (talk) 14:49, 5 September 2010 (UTC)
- I agree that this is about risk evaluation, rather than certainty. Uncle G (talk) 15:23, 5 September 2010 (UTC)
- Minimal creativity, minimal content. I would regard that as a safe stub. If those couple of sentences were highly idiosyncratic, I'd probably look for a source. :) (Compelled to come back and clarify: I'm not saying that could not be a copyvio; it could, if it copies from another sources and especially if it is one of dozens of articles he's copied from that same source, which would clearly not be a de minimis situation. This is a risk assessment question.) In terms of other alternatives, there is an image-based CCI on which I'm working that is not this scale where most of the images are free of copyright problems. I am mass sorting these to separate out the ones that need review. If we had somebody of great patience who could separate these articles according to "likely to be a problem" and "not at all likely to be a problem", that would help. I had planned to ask the contributor to do that himself, but, as I said above, I'm no longer sure we could trust him with that task. --Moonriddengirl (talk) 14:49, 5 September 2010 (UTC)
- Concur this should be reserved for special circumstances and this clearly qualifies. Looking at the scale of the problem I don't see how a volunteer effort could clean all of that up. Extreme measures for extreme actions so I support letting the bots loose to undo what he has wrought. If some stubs are lost in the process they can always be recreated if they are notable by others. --WGFinley (talk) 15:07, 5 September 2010 (UTC)
- This is an enormous number of articles; do we really think that these are entirely (or almost entirely) copyright violations? I don't know how representative a single selection could be out of thousands of articles, but Swimming at the 1997 Summer Universiade, for example, doesn't seem to be a copyvio (I couldn't find anything for it on google except sites which copied the wikipedia article). GiftigerWunsch [TALK] 15:11, 5 September 2010 (UTC)
- I'm actually wondering how he hasn't had hundreds of warnings from the Coren searchbot by now... GiftigerWunsch [TALK] 15:15, 5 September 2010 (UTC)
- No, I don't think they're entirely or almost entirely copyright violations. I think many of them are harmless charts and tables. I think, though, that the number of articles that are copyright violations will probably number in the hundreds. High hundreds or low hundreds? I don't know. --Moonriddengirl (talk) 15:33, 5 September 2010 (UTC)
- This very probably is a weakness in the 'bot and in the Google Web approach. Take Mikiyasu Tanaka, for example. Picking some phrases at random from the article (e.g. "Tanaka was sent abroad by the Japan Olympic Committee to study volleyball") and giving them to Google Web doesn't turn up the FIVB profile that it was copied from. But it is a copy, nonetheless. The sentences are in a different order. But they are the same sentences, the only changes being things like exchange of proper nouns for pronouns and the like. (In the original, it is "he was sent abroad by the Japan Olympic Committee to study volleyball".) Uncle G (talk) 15:36, 5 September 2010 (UTC)
- I'm actually wondering how he hasn't had hundreds of warnings from the Coren searchbot by now... GiftigerWunsch [TALK] 15:15, 5 September 2010 (UTC)
- Giftiger , try this [2] for Swimming at the 1997 Summer Universiade. 81.145.247.158 (talk) 15:40, 5 September 2010 (UTC)
- The fact that I couldn't find that quickly and that it's a single article out of thousands, does not bode well for our chances of being able to manually fix all these copyright problems... GiftigerWunsch [TALK] 15:43, 5 September 2010 (UTC)
- Ahem, it's listed as one of the references in the article. 81.145.247.158 (talk) 15:46, 5 September 2010 (UTC)
- It would be nice if we could rely on that, but we can't always. :) --Moonriddengirl (talk) 15:47, 5 September 2010 (UTC)
- There was no clue in the article where Edward Liddie was lifted from, for example. Uncle G (talk) 15:53, 5 September 2010 (UTC)
- Ahem, it's listed as one of the references in the article. 81.145.247.158 (talk) 15:46, 5 September 2010 (UTC)
- The fact that I couldn't find that quickly and that it's a single article out of thousands, does not bode well for our chances of being able to manually fix all these copyright problems... GiftigerWunsch [TALK] 15:43, 5 September 2010 (UTC)
- Rather than a straightforward mass deletion, may I suggest an element of triage? Some of the articles that this editor created will have been edited by others, and some will be more or less notable than others. If we identify and delete those that are tagged as orphans, unreferenced or tagged for notability would that leave us something more manageable? ϢereSpielChequers 16:04, 5 September 2010 (UTC)
- Almost certainly not. I've reviewed a few hundred of the biographies now. Yes, it's less than 5% of the problem, but I was selecting at random, from the list before it was sorted, so I have little suspicion that my sample is biased. Notability is almost never an issue on which these subjects have been challenged or tagged. These are not exactly minor sporting figures and events. Likewise, orphan status would be problematic. Many of these articles are on navigation templates for sports teams, regular sporting competitions, and the like, and are unlikely to be orphans. (Quite a few cross-reference one another, too.) Nor, indeed, is lack of any citations a recurrent issue. Darius Dhlomo has linked almost all of xyr creations to on-line sports databases and the like. As criteria for filtering out the problematic articles, from what I've seen I suspect these won't be useful at all.
I suggested that we find some filtering criteria, above. I haven't yet come up with any, and Moonriddengirl quite rightly notes, above, that it might not be safe from a copyright perspective to even do that. Even the 1-paragraph stubs might be a mass copying exercise, from some source that we are unaware of. All of us who have reviewed the article set so far seem to have come to the same conclusion, that Darius Dhlomo simply doesn't write original prose, at all, anywhere, even if it's only a couple of sentences to make up a small paragraph. Pick a couple of hundred for yourself, check them for copyright violations, and see what conclusions you draw.
If you find from doing so some triage criteria that actually work in practice, that would be good news, of course. ☺ Uncle G (talk) 16:29, 5 September 2010 (UTC)
- Almost certainly not. I've reviewed a few hundred of the biographies now. Yes, it's less than 5% of the problem, but I was selecting at random, from the list before it was sorted, so I have little suspicion that my sample is biased. Notability is almost never an issue on which these subjects have been challenged or tagged. These are not exactly minor sporting figures and events. Likewise, orphan status would be problematic. Many of these articles are on navigation templates for sports teams, regular sporting competitions, and the like, and are unlikely to be orphans. (Quite a few cross-reference one another, too.) Nor, indeed, is lack of any citations a recurrent issue. Darius Dhlomo has linked almost all of xyr creations to on-line sports databases and the like. As criteria for filtering out the problematic articles, from what I've seen I suspect these won't be useful at all.
- I'm going to support whatever Moonridden girl thinks is best. I'm convinced we are going to have to take drastic measures here. If she thinks triage would work, fine, but on the other hand, I don't want to give already over-burdened editors doing herculean work in the copyright field even more to do. Dougweller (talk) 16:13, 5 September 2010 (UTC)
- I don't know what bots can do. There are some good suggestions in this thread for narrowing down the list by presumptively deleting those that are least likely to impact others in the project, but I'm afraid that short of mass deletion the only way to process most of this is going to involve a human being (or two or ten) looking at each article. I would definitely support at this point simply wiping out creative text supplied by this contributor. But it's still going to take a ton of man hours just to review them all. --Moonriddengirl (talk) 21:19, 5 September 2010 (UTC)
- I think some simple criteria can be defined and then a script can check all the articles against the criteria without humans having to look at them. Defining the criteria would take a little bit of work. Example criterion: find all articles that don't contain text added by humans other than Darius Dhlomo. The "text added by humans" part means ignore edits made by known maintenance bots (interwiki etc), edits to metadata only (like categories), or edits with certain strings in the edit summary indicating various script-assisted edits unlikely to add new human-written text to the article. Deleting those articles might shrink the problem by enough to make manual triage feasible for the remaining ones. 67.122.211.178 (talk) 22:23, 5 September 2010 (UTC)
- I don't know what bots can do. There are some good suggestions in this thread for narrowing down the list by presumptively deleting those that are least likely to impact others in the project, but I'm afraid that short of mass deletion the only way to process most of this is going to involve a human being (or two or ten) looking at each article. I would definitely support at this point simply wiping out creative text supplied by this contributor. But it's still going to take a ton of man hours just to review them all. --Moonriddengirl (talk) 21:19, 5 September 2010 (UTC)
- How about running a script that identifies all of these articles with no more than 500 characters (or some other number) contributed by editors other than Darius Dhlomo. That might be most of the affected ones and they can then be deleted, making the problem a lot smaller. 67.122.211.178 (talk) 18:02, 5 September 2010 (UTC)
- As per my comment above, I support mass deletion here. As I read the evidence from Uncle G so far (thank you), it's unlikely we can find any safe triage parameters. Although there's no great rush so I'm more than happy to wait for suggestions.--Mkativerata (talk) 19:18, 5 September 2010 (UTC)
- Comment: a skim of some smaller contribs suggest quite a few edits are just editing categories, DEFAULTSORT and the like. Can these be excluded from the listing using some automation? As for creations, I would suggest nuking the lot (perhaps leaving a log which then someone like Article Rescue Squadron can use, if they feel like taking responsibility for checking individual entries. In this case, for "nuking", read "userfying" or "incubating".) Rd232 talk 00:15, 6 September 2010 (UTC)
- I don't think userfying or incubating copyvio articles gets rid of the copyright problems. It probably makes the problems worse. 67.122.211.178 (talk) 05:42, 6 September 2010 (UTC)
- Properly userfied (or any incubated) articles are hidden from search engines, and so effectively not really published any more. So a bot can do this for all questionable pages immediately, and then a little more time can be taken to see if there's anything that can be salvaged. Then delete the remaining userfied/incubated pages. Incubated pages are deleted after a time anyway (1 month?). Of course they would need to be tucked away in a subsection or something of the Incubator, to avoid swamping everything else in it. Rd232 talk 09:10, 6 September 2010 (UTC)
- That is a good point. I think you are right that moving the articles to incubation can be better than leaving them in article space. 75.57.241.73 (talk) 23:18, 8 September 2010 (UTC)
- Properly userfied (or any incubated) articles are hidden from search engines, and so effectively not really published any more. So a bot can do this for all questionable pages immediately, and then a little more time can be taken to see if there's anything that can be salvaged. Then delete the remaining userfied/incubated pages. Incubated pages are deleted after a time anyway (1 month?). Of course they would need to be tucked away in a subsection or something of the Incubator, to avoid swamping everything else in it. Rd232 talk 09:10, 6 September 2010 (UTC)
- I don't think userfying or incubating copyvio articles gets rid of the copyright problems. It probably makes the problems worse. 67.122.211.178 (talk) 05:42, 6 September 2010 (UTC)
- Suggestion Just have a bot (you can use AWB) mark all of them as CSD: G12 for copyvio. They can then be deleted very easily and quickly. Presumably whatever admin gets to it will check the copyvio status. --Selket Talk 00:38, 6 September 2010 (UTC)
- That would be throwing the baby out with the bathwater. I've been looking at random examples, & either I don't understand the criteria for copyright violations (although the example above for Swimming at the 1997 Summer Universiade was pretty obvious when I found the right revision), or I'm not picking the right examples. Anyone but the most painstaking Admin, when faced with all of those edits tagged as CSD will only examine so many at the beginning before either giving up -- or simply untagging the rest. (And if I understood the proper way to clear those which I don't think are copyvios, I'd offer a hand thinning out this list.) -- llywrch (talk) 02:58, 6 September 2010 (UTC)
- Selket, we're talking about ten thousand articles, or maybe even 20,000+. It would take admins years to work through that many. 67.122.211.178 (talk) 05:42, 6 September 2010 (UTC)
- Let's also not forget that {{db-g12}} has a mandatory url parameter; a reviewing admin should simply remove the template if it doesn't have a url parameter. How would we automatically figure out what the url(s) being copyrighted are? GiftigerWunsch [TALK] 08:27, 6 September 2010 (UTC)
- Support deletion of them all. It's better to lose some good contributions than to keep so many copyright violations, and it's nearly impossible to check them one by one. Delete them all and ban the user. Fram (talk) 08:21, 6 September 2010 (UTC)
- Support Mass deletion. There is no criteria that can reasonably exclude the right (or wrong) kind of content here. While the worry of losing hundreds of "our" articles is understandable, in reality we have to remember that articles built on other people's content aren't ours to begin with. MLauba (Talk) 08:46, 6 September 2010 (UTC)
- Support mass article deletion. 20,000 over articles is just too many to go through, presuming that most of them are copyright violations. I really don't see a way where we can check them manually one by one. Bejinhan talks 10:24, 6 September 2010 (UTC)
- Mass deletion is a bad idea, people. That would get our article on 1943 robotically deleted. What we need here, if we truly are going down this route (which I'm sure we're all very hesitant about), is a special-case speedy deletion criterion that we can apply, such as "created by Darius Dhlomo, no substantive content edits other than by Darius Dhlomo, and containing actual running prose commentary rather than just raw uncopyrightable names, numbers, and dates". We need community authorization for Moonriddengirl and other administrators to perform speedy deletions under that specialized criterion. Uncle G (talk) 10:54, 6 September 2010 (UTC)
- It's not a speedy category, but we sort of already have authorization for that, though I don't trot it out under ordinary circumstances. Per policy at Wikipedia:Copyright violations: "If contributors have been shown to have a history of extensive copyright violation, it may be assumed without further evidence that all of their major contributions are copyright violations, and they may be removed indiscriminately." Ordinarily, when I encounter an article at CCI that seems to be a copyvio but I can't prove it, I use the copyvio template on the face and Template:CCId to give interested contributors a week to look at them and offer input. (Keeping in mind that by the time we get to CCI, we have verification of multiple violations of copyright policy; this template and this approach are not supported by policy where there is not a proven history of extensive copyright violation from a contributor.) The problem with this approach here is that all of these articles would be listed by bot at WP:CP, which would totally break the board. If we created a different template for the face that would not be placed at CP, but instead categorized by date, it could be manageable to have a bot tag at least the articles he's created. It still requires human review, which would be time consuming, but we could then delete or stub the ones to which he's added substantial text, remove the tag from those without. It's kind of a cross between the delete them all approach (which, having worked CCIs for some time now I understand in this case) and the legitimate desire not to lose more than we have to. --Moonriddengirl (talk) 11:47, 6 September 2010 (UTC)
- If you want a category for the created articles to progressively de-populate, I could get Uncle G's major work 'bot to append Category:Articles created by Darius Dhlomo (or some template including it) to everything on VernoWhitney' original list. Would a category with just under ten thousand articles in it be useful? There would be no date or size sorting. Uncle G (talk) 12:58, 6 September 2010 (UTC)
- Yes, that would be helpful. I was just working up a template based on {{copyviocore}} and {{CCId}} with a similar notion: User:Moonriddengirl/CCIdf. Not sure if we should blank these articles as we do with {{copyvio}} (if nothing else, it makes it clear that there's a time limit) or take an approach more like {{PROD}}, but since I used {{copyviocore}} it's created from the presumption of blanking. Would something like that be helpful? It's still going to be a ton of work, but it would make the job easier if we delay admin processing for seven days. At that point, we can G6 anything that meets the criteria: extensive creative content added by Darius Dhlomo that cannot be removed without leaving an unusable article (similar to G12). It also allows interested contributors an opportunity to get rid of all creative content added by Darius Dhlomo. But we'd need to do something to flag if the template is removed out of process; in copyright cleanup, that happens quite regularly. People don't want the article deleted, and they don't much seem to care if it's a copyvio or not. --Moonriddengirl (talk) 13:13, 6 September 2010 (UTC)
- I'm rather taken with Rd232's idea of — well, to be frank — sharing the pain. I forsee three things needed:
- A template for the 'bot to blank the article with (Community discussion needed: Should the 'bot blank the article?) I've got something a bit shorter than User:Moonriddengirl/CCIdf in mind. I'll be bold if you've no objection. I think that we might need two templates, one for the blanking and one for a deletion nomination.
- An explanation of the 'bot's task, to be used in all 'bot edit summaries so that people seeing their watchlists light up with a thousand blanked articles have somewhere immediately to go for an explanation.
- Instructions for editors on what to do now, to be linked to from the template notice.
- Additional points: The template notice must be carefully worded. I don't think it fair to have Darius Dhlomo's name come up all over Google. The instructions must be clear that this is a complex task that can end in a multiplicity of outcomes (from {{copyvio}} to simple removal of the template). The instructions must also be clear that editors restoring content shoulder the responsibility for doing so. All of the notices and instructions must be hashed out before the 'bot begins. And we need community attention given to the fact that a 'bot is about to mass-blank some ten thousand articles. (I've updated centralized discussions, and to bring more attention here. I've also put a notice on the 'bot owners' noticeboard.)
I cannot help you with noticing template removals. But there are other 'bot owners who probably can. There are 'bots that note Proposed Deletion challenges. Uncle G (talk) 14:09, 6 September 2010 (UTC)
- No objections from me. :) I'm all about getting the work done, however we can best do it. You can change my mock-up directly or build your own, whatever works. --Moonriddengirl (talk) 14:34, 6 September 2010 (UTC)
- I just took a look at User:Moonriddengirl/CCIdf, but I think it's a little impractical, and should be closer to what I suggested below: it should be removed by anyone who feels that they have addressed the copyright concerns. If it remains after a week, it should be deleted as IAR in a similar way to a PROD. Placing a template like this such that admins have to do all the work, would mean this problem is likely to never be solved. Once the majority of the articles are deleted, because no one has challenged their deletion as copyright vios, we can manually check anything that remains to confirm that the users who removed the template because they didn't find / managed to address the copyvios, has been properly addressed. GiftigerWunsch [TALK] 14:39, 6 September 2010 (UTC)
- I don't think the articles should be blanked, either; that's likely to just impede evaluation of the articles. GiftigerWunsch [TALK] 14:41, 6 September 2010 (UTC)
- Note: I have just created an alternative draft version of the template in my userspace; comments are welcome. GiftigerWunsch [TALK] 14:48, 6 September 2010 (UTC)
- No objections from me. :) I'm all about getting the work done, however we can best do it. You can change my mock-up directly or build your own, whatever works. --Moonriddengirl (talk) 14:34, 6 September 2010 (UTC)
- I'm rather taken with Rd232's idea of — well, to be frank — sharing the pain. I forsee three things needed:
- Yes, that would be helpful. I was just working up a template based on {{copyviocore}} and {{CCId}} with a similar notion: User:Moonriddengirl/CCIdf. Not sure if we should blank these articles as we do with {{copyvio}} (if nothing else, it makes it clear that there's a time limit) or take an approach more like {{PROD}}, but since I used {{copyviocore}} it's created from the presumption of blanking. Would something like that be helpful? It's still going to be a ton of work, but it would make the job easier if we delay admin processing for seven days. At that point, we can G6 anything that meets the criteria: extensive creative content added by Darius Dhlomo that cannot be removed without leaving an unusable article (similar to G12). It also allows interested contributors an opportunity to get rid of all creative content added by Darius Dhlomo. But we'd need to do something to flag if the template is removed out of process; in copyright cleanup, that happens quite regularly. People don't want the article deleted, and they don't much seem to care if it's a copyvio or not. --Moonriddengirl (talk) 13:13, 6 September 2010 (UTC)
- If you want a category for the created articles to progressively de-populate, I could get Uncle G's major work 'bot to append Category:Articles created by Darius Dhlomo (or some template including it) to everything on VernoWhitney' original list. Would a category with just under ten thousand articles in it be useful? There would be no date or size sorting. Uncle G (talk) 12:58, 6 September 2010 (UTC)
- ? Darius didn't create 1943. Is it being suggested to delete all articles Darius has ever touched? I thought we were deleting all Darius creations in order to cut the size of the list of contributions needing review. Maybe do that, and blank/special tag all articles he's touched (excluding identifiable cases of minor changes only, like categories)? Rd232 talk 12:06, 6 September 2010 (UTC)
- 1943 is on the list of articles that we have. It's on page 24. (Yes, I've been to page 24. I've even tagged 1943 as not a copyright violation.) Mass deletion of everything on the list gets 1943 and many other such articles deleted. We don't need mass deletion, and mass deletion wouldn't be right. What we need is (a) community confirmation that it's an acceptable loss to the project to lose articles such as Paul Easter and Yohann Bernard, (b) community confirmation that we don't trust any running prose content by this editor not to have been copied from somewhere, and (c) community confirmation that it's an acceptable risk to the project to retain articles such as Matías Médici and Franklin Chacón. Uncle G (talk) 12:37, 6 September 2010 (UTC)
- I don't think anyone has proposed mass-deleting every article Darius contributed to. Just the ones he created, and (in some versions) just the ones he created that meet certain other criteria (like absence of substantial contributions from other users). So bringing up 1943 is a red herring. 67.122.211.178 (talk) 17:55, 6 September 2010 (UTC)
- 1943 is on the list of articles that we have. It's on page 24. (Yes, I've been to page 24. I've even tagged 1943 as not a copyright violation.) Mass deletion of everything on the list gets 1943 and many other such articles deleted. We don't need mass deletion, and mass deletion wouldn't be right. What we need is (a) community confirmation that it's an acceptable loss to the project to lose articles such as Paul Easter and Yohann Bernard, (b) community confirmation that we don't trust any running prose content by this editor not to have been copied from somewhere, and (c) community confirmation that it's an acceptable risk to the project to retain articles such as Matías Médici and Franklin Chacón. Uncle G (talk) 12:37, 6 September 2010 (UTC)
- It's not a speedy category, but we sort of already have authorization for that, though I don't trot it out under ordinary circumstances. Per policy at Wikipedia:Copyright violations: "If contributors have been shown to have a history of extensive copyright violation, it may be assumed without further evidence that all of their major contributions are copyright violations, and they may be removed indiscriminately." Ordinarily, when I encounter an article at CCI that seems to be a copyvio but I can't prove it, I use the copyvio template on the face and Template:CCId to give interested contributors a week to look at them and offer input. (Keeping in mind that by the time we get to CCI, we have verification of multiple violations of copyright policy; this template and this approach are not supported by policy where there is not a proven history of extensive copyright violation from a contributor.) The problem with this approach here is that all of these articles would be listed by bot at WP:CP, which would totally break the board. If we created a different template for the face that would not be placed at CP, but instead categorized by date, it could be manageable to have a bot tag at least the articles he's created. It still requires human review, which would be time consuming, but we could then delete or stub the ones to which he's added substantial text, remove the tag from those without. It's kind of a cross between the delete them all approach (which, having worked CCIs for some time now I understand in this case) and the legitimate desire not to lose more than we have to. --Moonriddengirl (talk) 11:47, 6 September 2010 (UTC)
- Alternative: perhaps there is a slightly less destructive way to go about this. Instead of outright deleting every article the user created, why not PROD them all as being potential copyvios, directing to this discussion, with the added condition that users removing the PROD are asserting that they have looked for copyvios and dealt with any they've found? This could be explained in the prod message. The majority of the articles will probably be left for the PROD to expire, in some cases users who have contributed a decent amount to the articles will deal with the copyvios and remove the prods, and then we're just left with a hopefully much smaller number of articles where the prod has been removed without the copyvios being solved; whichever articles survive can then be checked by those who are willing to help solve this case. Any thoughts? GiftigerWunsch [TALK] 12:23, 6 September 2010 (UTC)
- It may also be preferable to use a custom template instead of a PROD, so that editors don't get confused that additional conditions have been applied to the PROD, and then articles which still have the template after a set time (7 days? 10?) could be deleted as IAR. GiftigerWunsch [TALK] 12:25, 6 September 2010 (UTC)
- This is a good idea.
Also, echoing Rd232, my thought was in fact that we delete (or, per Giftiger, custom-prod) every article Darius created, not every article he touched. Nandesuka (talk) 12:30, 6 September 2010 (UTC)
- Not quite. Darius Dhlomo created Albertina Dias, but most of its content was written by Sillyfolkboy. There are plenty like that. Uncle G (talk) 12:58, 6 September 2010 (UTC)
- Yes. Please be aware that I completely rewrite articles like this from scratch on a fairly regular basis. If anyone sees my name in the history of an athlete biography, please message me first if you're planning on deleting a big chunk of my work! SFB/talk 22:41, 10 September 2010 (UTC)
- Not quite. Darius Dhlomo created Albertina Dias, but most of its content was written by Sillyfolkboy. There are plenty like that. Uncle G (talk) 12:58, 6 September 2010 (UTC)
- The sheer number of copyright violations uncovered here makes it necessary to take drastic action—even if that means deleting thousands of articles. As Moonriddengirl says, there is already a process in place to do this, and I trust the people involved with this to handle the deletions in an appropriate way. GiftigerWunsch's suggestion in the above post is reasonable; {{CCId}} (which Moonriddengirl mentioned above) can be used for that purpose. Ucucha 12:34, 6 September 2010 (UTC)
- Whoops, I hadn't read Moonriddengirl's suggestion above; I guess great minds think alike ;) GiftigerWunsch [TALK] 12:42, 6 September 2010 (UTC)
- I guess so. :) I'm posing some more ideas on this a bit higher up. --Moonriddengirl (talk) 13:13, 6 September 2010 (UTC)
- I think the best thing to do is have a bot deal with every affected article. Initially, this would be page blanking, and replacing with a Darius-related explanatory note template. The copyright problem then goes away immediately, and more time can be taken to rescue articles, with effort drawn from lots of people not normally active on copyright issues. Then, after say 30 days, mass-delete (or possibly mass-incubate, if to a special subsection to avoid swamping the WP:INCUBATOR) any still tagged. Use a bot to notify major contributors of the action, and thus spread the work involved very very widely. Rd232 talk 13:22, 6 September 2010 (UTC)
- I know it's a PITA (Pain in the a**) but the simplest solution(s) would be to revert all edits and start over, even if it means just leaving stubs. That basically means using a bot, given the number of edits, and turning most of the affected articles into stubs. A stub can be improved. Modifying copyrighted data would definitely compromise any editor who did so even unknowingly.Kadathdreamques (talk) 23:51, 13 September 2010 (UTC)
- I guess so. :) I'm posing some more ideas on this a bit higher up. --Moonriddengirl (talk) 13:13, 6 September 2010 (UTC)
- Whoops, I hadn't read Moonriddengirl's suggestion above; I guess great minds think alike ;) GiftigerWunsch [TALK] 12:42, 6 September 2010 (UTC)
- Support mass deletion. We need to err on the side of safety and I don't think the community has the capacity to do this by hand in a consistent manner. Leaving hundreds of blank articles isn't a good for reader experience. Kaldari (talk) 17:54, 16 September 2010 (UTC)
Mass blanking of ten thousand articles by a 'bot
Just to make this clear, with its own section heading, here's the idea:
- A 'bot goes through everything on VernoWhitney's original list of some ten thousand articles, which are the articles that Darius Dhlomo created. It blanks the article and replaces it with a template notice.
- Community discussion required: Should the 'bot blank the article? This is a copyright issue.
- I volunteer Uncle G's major work 'bot for the task.
- Community discussion required: This 'bot does not have the 'bot flag. The 'bot flag will stop people's watchlists lighting up with thousands of blanked articles. I'm happy to have the 'bot flagged for the duration of the task. But do we want people not to notice?
- The edit summaries of the 'bot link to an explanation of the 'bot's task, which gives people something to look at straightaway to see what's going on and why.
- The notice itself links to instructions for editors, with a clear procedure for assessing copyright infringement.
- The notice also categorizes all articles into a (hidden) category Category:Articles created by Darius Dhlomo.
- Community discussion required: Category:Mass infringement copyright cleanup or some such instead?
- Editors assess the status of the article and act appropriately.
- Community discussion required: Do we put a time limit on this? Do we incubate any articles left blanked after 30 days, per Rd232 above? Or do we just leave them blanked long-term for people to address at leisure?
- We provide a streamlined version of {{copyvio}} that is dedicated to this task and that doesn't have all of the overhead of the normal procedure. An article tagged with the special process can go straight to deletion assessment by an administrator without the additional listing overheads and suchlike. But administrators can only delete such articles if they were previously tagged as part of the cleanup effort in the first place. (Vandals don't get to abuse the template in the obvious way.)
- Community discussion required: Could we just re-use the existing speedy deletion notices instead?
None of this is happening immediately. The relevant notices and templates need to be set up before we even think of starting such a 'bot task. This is a proposal, condensed from the above. There are questions yet to be answered. Note that it addresses just under ten thousand articles. There are just over thirteen thousand articles in the cleanup list. If all goes according to plan, this will let the CCI folks reduce the list to just the three thousand or so articles touched by Darius Dhlomo but not created by xem.
The major advantage of this over the mass deletion idea is that it shares the task around ordinary editors, rather than concentrating it in the hands of a handful of administrators. Everyone has the tools to action the next step after a blanking.
Please discuss. Uncle G (talk) 14:53, 6 September 2010 (UTC)
- Support this seems more sensible than mass deletion, may I assume that the categories, tags and external links will be left unblanked as they are unlikely to constitute a copyvio? Also would it be possible to run this using a bot without the bot flag? Like many users I ignore bot edits on my watchlist so I would be unaware of any articles that I'm interested in being blanked. Also is there any chance of a second bot run going through these articles and comparing the last version edited by the copyviolater with the version before it is blanked, as there is no point losing such articles. ϢereSpielChequers 15:40, 6 September 2010 (UTC)
- There is some merit to the idea of saving the categories/tags (and sigh, maybe the extlinks too), though that info can be saved elsewhere even if the articles are deleted. 67.122.211.178 (talk) 18:10, 6 September 2010 (UTC)
- It's either blanking everything or blanking nothing and prepending/appending the template if you want my 'bot to do it. It's only geared for the simplest of content work. I could write something to do more complex editing, but I don't have such ready to hand.
There's nothing inhibiting a 'bot from running that doesn't have the flag. The flag allows everyone to exclude 'bot-marked edits from various lists, like watchlists and recent changes. Usually 'bots don't make edits that are interesting to recent changes patrollers or people with watchlists. The question is whether that's true in this case. Please discuss.
As for the last, note that a blanked article is not the end point, but a mid-way point in the process. The whole idea of blanking, rather that deletion, is that we don't lose edit history unnecessarily, and that anyone with the ordinary edit tool can recover content if there turns out to be no copyright violation. Uncle G (talk) 16:38, 6 September 2010 (UTC)
- IMO the bot should be flagged (RC patrol has enough to do without dealing with this) but should log all its actions on some special pages that everyone can review. That includes describing its analysis of pages that it then decides not to edit, so the log pages would be more informative than the bot's contrib history. The bot should probably run under a specially made new account for this purpose too. 67.122.211.178 (talk) 22:16, 6 September 2010 (UTC)
- Comment Do you think the bot could identify and tag separately those articles with > 500 edits, or something in that neighborhood? At that point there is likely to be little residual copyvio, so it needs to be looked at differently. Also, would the coren copyviofinderbotthingy (phew) bot be able to check a category, and would it be any use? 69.236.190.48 (talk) 16:08, 6 September 2010 (UTC)
- If VernoWhitney is capable of coming up with separate lists of such articles to work from, I can certainly tag them differently for each list. I don't know what CorenSearchBot is capable of in respect to scanning old revisions of existing pages in categories. Uncle G (talk) 16:38, 6 September 2010 (UTC)
- I don't believe that articles with >500 edits necessarily have little residual copyvio. Look at the history of Between Silk and Cyanide, which I just had to mostly-blank because somebody inserted a copyvio in 2006, even though dozens of other editors worked on the article after that. The additional editing expanded the article a lot and smeared the copyvio all over the article, so it was no longer revertable in one lump. There is a comment on the talk page explaining further. I do think articles with substantial text added by human editors (bots edits don't count) other than Darius should be flagged. 67.122.211.178 (talk) 18:10, 6 September 2010 (UTC)
- Well, I think that articles with 500 edits generally are large enough they would need human review the most of any of the articles. NativeForeigner Talk/Contribs 18:56, 6 September 2010 (UTC)
- Yes. Perhaps the bot could blank those in two edits. It would first detect all the text that had been added by Darius (as opposed to other editors) and mark that text with a font or color change and save it. Then it would blank the whole article and save again. People wanting to restore the article could look at the marked-up version in the history and use that to help separate Darius text from non-Darius text. 67.122.211.178 (talk) 22:29, 6 September 2010 (UTC)
- Well, I think that articles with 500 edits generally are large enough they would need human review the most of any of the articles. NativeForeigner Talk/Contribs 18:56, 6 September 2010 (UTC)
Prod-like proposal
I thought I'd move my proposal here as it's getting lost in the discussion and I feel it would be beneficial. I propose that all articles that Darius has created be tagged with a template similar to a PROD (I've created a draft here which all are free to edit and comment on), such that the article will be automatically deleted after 7 days (or perhaps longer, depending on consensus). Like a prod, anyone can remove the template, but unlike a prod, in doing so they are asserting that they understand copyright violation, and have thoroughly checked the article and fixed any copyvios. This would be clearly explained on the template.
Any articles which are not checked or which cannot be saved, will still have the template after the time has expired, and will be deleted per WP:IAR. Those articles which survive can then be double-checked to confirm that they are not copyvios.
Hopefully during this process, a large number of editors will have noticed that an article they've contributed to is being sorta-prodded, and will help to remove copyvios and then remove the template. Any articles where no one noticed the sorta-prod deletion are acceptable losses, being deleted per usual PROD rules anyway. GiftigerWunsch [TALK] 15:02, 6 September 2010 (UTC)
GiftigerWunsch [TALK] 15:02, 6 September 2010 (UTC)
- This makes much more sense then a mass deletion. I support this. elektrikSHOOS 16:57, 6 September 2010 (UTC)
- By this process, the articles will be automatically deleted unless somebody objects. By the process above, human review is needed, but articles that are not infringements will be salvaged. --Moonriddengirl (talk) 16:59, 6 September 2010 (UTC)
- On the other hand, manually reviewing thousands of articles requires an enormous amount of time and resources, and alerting those who have contributed to the articles to help sort out the copyright issues by otherwise having them deleted after a fixed period, means that the job will be distributed among many people, and hopefully achieved in less time. GiftigerWunsch [TALK] 17:50, 6 September 2010 (UTC)
- On the gripping hand, when the 7 day Proposed Deletion period expires on ten thousand articles all at the same time, we're right back in the same position that we started from. ☺ Uncle G (talk) 18:41, 6 September 2010 (UTC)
- On the other hand, manually reviewing thousands of articles requires an enormous amount of time and resources, and alerting those who have contributed to the articles to help sort out the copyright issues by otherwise having them deleted after a fixed period, means that the job will be distributed among many people, and hopefully achieved in less time. GiftigerWunsch [TALK] 17:50, 6 September 2010 (UTC)
- By this process, the articles will be automatically deleted unless somebody objects. By the process above, human review is needed, but articles that are not infringements will be salvaged. --Moonriddengirl (talk) 16:59, 6 September 2010 (UTC)
There's not that much different between this and the above proposal, except in the matter of imposing a time limit, and defaulting to automatic deletion. Defaulting to automatic deletion will get screams of outrage from people who find out after the fact, I predict with a fair degree of confidence. (That's in part why I've pointed to this discussion on as many noticeboards as I have. I want to reduce the number of people who find out after the fact.) "Why wasn't I warned before you went off and deleted thousands of articles relevant to my WikiProject? I'm going to abuse an administrator for this!", they'll say. Leaving articles blanked, with just a warning notice on them, to be addressed at somewhat greater leisure, avoids that drama before it starts. It also addresses the concern that people have — that we all have — about not losing articles that aren't copyright violations at all. Uncle G (talk) 18:41, 6 September 2010 (UTC)
The above proposal is now more concrete. I've boldly updated User:Moonriddengirl/CCIdf and written Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation (for edit summaries) and Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help (for the notice). Please review, discuss, and boldly improve. Uncle G (talk) 18:41, 6 September 2010 (UTC)
- I'm somewhat concerned, though, that "at somewhat greater leisure", considering that we're talking about thousands of articles, is going to mean months, or longer. If the articles are all going to be blanked until such time as an administrator (or at least other editors) manage to check them for copyvios and restore them, what's the difference? If the deletions are found out after the fact, interested parties could request undeletion so that the article could be rechecked for copyvios, the references could be salvaged, the article could be written, or whatever else. GiftigerWunsch [TALK] 19:25, 6 September 2010 (UTC)
- Thousands don't need to take months if everybody pitches in. I haven't seen you at CCI yet ;) —fetch·comms 02:12, 7 September 2010 (UTC)
- Fetchcomms, rather than looking at 1000's of articles for a few seconds each, I wonder if you could help investigate Rocío Ríos (see section below). That boilerplate text came from somewhere. If we can't establish where, then we can't rely on whatever processes you're proposing to use on the other 10,000 (or 20,000 or 40,000 depending on how you count) affected articles. 67.122.211.178 (talk) 09:13, 7 September 2010 (UTC)
- Thousands don't need to take months if everybody pitches in. I haven't seen you at CCI yet ;) —fetch·comms 02:12, 7 September 2010 (UTC)
- I like this suggestion. Will discuss more at Nuclear option below. --Moonriddengirl (talk) 12:14, 7 September 2010 (UTC)
When voting is not evil
I'm suggesting something that is "out of the box" to help solve this problem. The outlines of this suggestion are as follows:
- We have a real, majority-rules vote on whether to mass-delete all of the articles DD created. Or to adopt Uncle G's bot proposal above.
- A username's vote is counted only if that username has reviewed 10 or more of the articles listed at Wikipedia:Contributor copyright investigations/Darius Dhlomo or one of the associated pages.
- One can vote more than once (i.e., review 20 articles, you get 2 votes), under different usernames, etc. One doesn't need to be an Admin.
- The vote is a simple yes/no. Either we do it or we don't.
- At the end of a week or 10 days, the votes are counted.
what I like about this idea is even if the result is to mass-delete these articles, some will have been examined & determined to not be copyvios. (Based on previous discussions on AN/I, this could lead to as many as 100 people participating, which would mean at least 1000 articles examined.) Hopefully, this would give us further information about a more precise filter for where the copyvios are & aren't. Thoughts? (And while the conversation continues, I'll be working thru the list of possible copyvios; I know I can't save all of the non-copyvios, but I know I can save some.) -- llywrch (talk) 17:04, 6 September 2010 (UTC)
- This doesn't sound like a good idea to me. I'm even skeptical that any of us can review an article and determine the absence of copyvio. Even if the article has no text (e.g. just a table), maybe that table was copied from somewhere. And anyway, voting is always evil, and the likelihood of getting 10000 articles manually reviewed reliably is very small. If you've got a criterion like "all info is in a table and there are no strings of more than 5 consecutive english words" we could run a script that finds and lists such articles. That might be quite a lot.
Anyway, reviewing some tiny fraction (10 or 20) of the articles shouldn't give a special say over the rest of them. If you want extra authority over the whole collection of articles, you have to review all of them. Under your voting scheme, voters should also accept (and be assigned) responsibility for any copyvios that later turn up in articles you have declared to be free of copyvios. Otherwise your suggested voting system gives influence-seekers incentive to "review" articles as quickly as they can, and potentially miss a lot of bad stuff. 67.122.211.178 (talk) 18:12, 6 September 2010 (UTC)
- If a human can't properly determine what is a copyvio, then how can one tell a bot how to do it? And the idea of giving "a special say" isn't that: it's to encourage people to actually tackle the problem of cleaning up this mess, rather than the usual process of talking about the problem. (Out of over a dozen posters to this lengthy thread -- of whom three want a mass deletion -- less than half have even left evidence that they reviewed any of the entries at the CCI.) People here appear a lot more eager to tell us what the solution is & expect someone else to do it, than to actually help fix the problem. And fixing the problem isn't hard just tedious, & would be handled quickly enough if enough people spent their time there & not at this DramazBoard. -- llywrch (talk) 21:43, 6 September 2010 (UTC)
- The difficulty of determing what is a copyvio is at the root of the mass deletion proposal. The proposal is basically that if the article was created by Darius, there is an a priori likelihood that it is a copyvio whether we can find the original source or not, so we should delete it in the expectation that someone else will eventually recreate it if the topic is important. As for special say, no, sorry, if you review 20 articles, that's less than 0.1% of the articles affected, and you can't say "well look, I've made a significant dent in this problem" because 0.1% is insignificant. If you review 5000 of the articles, then your argument may have a bit more validity. That's why most of us (I think) are giving high credence to the views of Moonridden girl, because of the enormous amount of time she's spent dealing with this sort of problem. Looking at 20 or 50 or 100 of these Darius spewings doesn't hold a candle compared to that. 67.122.211.178 (talk) 22:01, 6 September 2010 (UTC)
- Yeah, I've looked through about 40 pages he created, and I have to say that, while most are not vios right out of the box, some are, while some are added later. There's no way to figure out any pattern unless you get through at least a couple hundred. I disagree that we should just delete assuming that they're all vios, which they obviously are not. Manual is hard, but not impossible if everyone helps. These articles are of notable people, and shouldn't be deleted on the assumption that someone will eventually create a non-cv version of them. —fetch·comms 04:28, 7 September 2010 (UTC)
- Not impossible if everyone helps -- but why should people spend their time that way? [S]houldn't be deleted on the assumption that someone will eventually create a non-cv version of them. I don't know what the issue is, we have 3.3 million articles that all got created somehow, and we're talking about a bunch of almost content-free stubs once the presumed copyvios are removed. Maybe it's just me but I don't see much point in getting attached to such articles. Nobody is proposing to salt them from recreation. We got along without them up to when they were created (some fairly recently) and (for most of them) nobody else thought they were interesting enough to edit substantially. I could see some value to keeping the names/references/categories in a list someplace. We already have the names in the CCI report, if that helps. 67.122.211.178 (talk) 08:10, 7 September 2010 (UTC)
- Well, if we just reword all of the stubs and delete the (few) longer articles he's written or expanded with prose (not just standard tables), while keeping them blanked until a rewrite (as the option below outlines), I see no reason why they all need to be removed. Hiding the cv while preserving the content history sounds like a fair compromise. —fetch·comms 00:14, 8 September 2010 (UTC)
- This doesn't sound like a good idea to me. I'm even skeptical that any of us can review an article and determine the absence of copyvio. Even if the article has no text (e.g. just a table), maybe that table was copied from somewhere. And anyway, voting is always evil, and the likelihood of getting 10000 articles manually reviewed reliably is very small. If you've got a criterion like "all info is in a table and there are no strings of more than 5 consecutive english words" we could run a script that finds and lists such articles. That might be quite a lot.