Wikipedia talk:Contributor copyright investigations/Darius Dhlomo

Presumptive removal?

Should any prose that smells of copyvio be presumptively removed? I have already found one definite and three possibles in a fairly small sample size and I think that with the potential scale of the problem presumptive removal would speed things up a little bit. Boissière (talk) 21:56, 4 September 2010 (UTC)[reply]

Yes, they should be presumptively removed. With the massive scale of this one there's really no other way to handle it, particularly since all of the articles currently listed are the ones they actually created. VernoWhitney (talk) 23:25, 4 September 2010 (UTC)[reply]

Can we look at Darius's edits by size of the edit instead? As I stated in the opening, the shorter articles (below 2.5KB creation size) he's created are practically a green light for original work. From a legal perspective, no one will bother contesting a couple of sentences describing basic, key information on a subject. Also, I would guess that the copyright problems will lie solely in biographies and not the likes of X at Games...etc. Sillyfolkboy (talk) (edits)^{Join WikiProject Athletics!} 00:47, 5 September 2010 (UTC)[reply]

Yeah, I got it in my head that it would be easier to split out created articles from the other articles they've edited but not created, which is why it ended up like this. I'm running it through my bot right now so tomorrow I should be able to update the pages with created articles sorted by edit size and then other edited articles also sorted by edit size. VernoWhitney (talk) 03:11, 5 September 2010 (UTC)[reply]

Need help?

I just saw this report on ANI and thought I'd see if you'd like some help. I've never gotten involved here so I'm unsure as to how this works, procedurally-speaking. Should I claim an article in the list somehow? I'm guessing the x graphics means no copyright issues found. What happens if I do find something plagiarized? How does it get tagged, and is there somewhere else that would be reported? Sorry for so many questions, but I want to make sure I'm going about it properly before I jump right in, so I don't end up creating even more work for someone. — e. ripley\^talk 04:36, 5 September 2010 (UTC)[reply]

Yes, {{n}} means no copyvio found. {{y}} Means there's a problem or at least a likely problem. If you find something that looks to be a problem, whether or not you can find a source, you can a) remove the copyvio yourself on the spot or b) replace the page with {{subst:copyvio|1=source}} and follow the instructions on the generated page that tell you how to list it on the Wikipedia:Copyright problems daily subpage for others to follow up on. VernoWhitney (talk) 12:40, 5 September 2010 (UTC)[reply]

And what does the red X that some editors have been using indicate? DGG ( talk ) 00:18, 9 September 2010 (UTC)[reply]

{{n}} generates

, so it means no copyvio found. {{y}} generates

which means there's a problem. VernoWhitney (talk) 00:27, 9 September 2010 (UTC)[reply]

Refining approach

This evening I have been trying to develop an API program which would take the wikitext of a suspect article and try to count up the amount of prose in it. It does this by dividing the article into sections and counting the words in each section. A section is principally either a normal section between two headings or a cell in a table. The program then reports the largest section. This way an article consisting mainly of tables should return a low value. Here is what it produces for Articles 61 through 80 (I chose this because this has a reported but not yet cleaned copyvio in Athletics at the 1980 Summer Olympics – Men's 3000 metre steeplechase).

Cycling at the 1972 Summer Olympics – Men's individual road race - Max words in a section = 190
National champions Javelin (men) - Max words in a section = 115
Athletics at the 1992 Summer Olympics – Men's 800 metres - Max words in a section = 34
Estonia national football team 1996 - Max words in a section = 59
1999–2000 in Dutch football - Max words in a section = 102
2009 Vuelta a Colombia - Max words in a section = 589
1987 Race Walking Year Ranking - Max words in a section = 47
2008 Women's Pan-American Volleyball Cup Squads - Max words in a section = 27
2004 UCI Road World Championships – Men's road race - Max words in a section = 40
European Sprint Swimming Championships 1994 - Max words in a section = 46
National Marathon champions (men) - Max words in a section = 103
Athletics at the 1980 Summer Olympics – Men's 3000 metre steeplechase - Max words in a section = 212
European Sprint Swimming Championships 1992 - Max words in a section = 49
Water polo at the 1988 Summer Olympics - Max words in a section = 112
Cycling at the 1992 Summer Olympics – Men's individual road race - Max words in a section = 152
Hockey at the 1999 Pan American Games - Max words in a section = 119
Squash at the 2007 Pan American Games - Max words in a section = 54
Athletics at the 1992 Summer Olympics – Men's 1500 metres - Max words in a section = 41
European Sprint Swimming Championships 1993 - Max words in a section = 104
Swimming at the 1995 Pan American Games - Max words in a section = 33

The program needs refinement - in 2009 Vuelta a Colombia it is being fooled by the list of teams near the end - I need to work out how to spot that. You can see that the copyvio article mentioned has a word count of 212. Is this an approach worth pursuing further? Boissière (talk) 22:51, 5 September 2010 (UTC)[reply]

Definitely worth doing as I imagine the large amount of biographies will be the difficult task to tackle. This will narrow them down immensely because so many of Darius's created biographies are just one or two sentences followed by tables. SFB/talk 20:59, 8 September 2010 (UTC)[reply]

Thanks for the feedback. I have held off from this for a bit due to all the hoo-ha related to this CCI as well as the problems with the program mentioned above. However I have tweaked the program a bit to separately give the size of the lead and the maximum size of the other sections of an article. The results of scanning the first 333 articles are given here. I am spurred on by the probability that the articles are going to be blanked which will cause me a few problems as the program simply reads the latest version. Boissière (talk) 11:36, 10 September 2010 (UTC)[reply]