Wikipedia:Bots/Requests for approval/Cronbot
- The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section.
Owner | User:Omicronpersei8 |
---|---|
Function | Tor open proxy detection and reporting |
Project | Wikipedia:WikiProject on open proxies |
Relevant pages | User:Cronbot/sb, User:Cronbot/PGP help, Harvard Tor exit node list |
Language | English |
Program | Python/pywikipedia framework |
Mode | Automatic |
Frequency | Always running: currently 60 seconds per page access, and 15 minutes minimum per edit; see User:Cronbot/sb for actual intervals |
Tracking Tor proxies, according to one Wikipedia:OP project member, is "a bloody nightmare". This is because Tor network topology is obscured by design, and while direct connections to suspected "exit nodes" are possible, two main problems arise: the list of exit nodes is constantly changing, and there are many "private nodes", a term I started using to describe nodes not listed on public Tor-tracking databases like the one at Harvard. The bottom line is that Tor nodes are ridiculously numerous, hard to track, and sometimes undetectable without going through the Tor network manually.
Cronbot addresses both of these issues. For public nodes, the bot will simply check block logs for IPs listed (i.e. at the Harvard database), and private nodes will be discovered by timed reconnections to the Tor network itself. Once the bot connects to a viable open proxy, it will try to post its open status on the User:Cronbot/sb page. We will know it really is the bot posting, because it will sign every post of this nature with cleartext PGP signatures.
More about Tor nodes for the purposes of this project is at User:Cronbot/PGP help.
Cronbot will post open "public" and "private" nodes on its userspace project page (currently in progress at User:Cronbot/sb). It is implied that every IP listed is inherently on the editing chopping block, since it would be difficult for project members to rigorously check these nodes themselves beyond eyeballing the relevant public lists; that is, the VCN method is ineffective here. Project members also need not post that a node has been blocked on the page in question, first because the bot will detect blocked nodes and remove them from the list automatically (or can be sped up in the delisting process by placing a recheck request in the messages section), and secondly because the entire page is blanked on every refresh save for the transcluded message area at the bottom.
Speaking of transclusion, the current idea is to transclude the bot's userspace project page on Wikipedia:OP. As of now, I have not really brought this idea up to the OP project members, but it would appear we have all been a bit too busy to monitor the project talk page anyway. This is certainly not a final decision, but I feel it is a good idea.
This is a useful bot, but has issues to consider. There are imaginable hazards attributed to this bot, and with reviewer permission, I feel it prudent to address them in full before activating potentially risky behaviors without proper authorization. First, there are scores of open Tor exit nodes on the Internet, most of which are currently unblocked from editing Wikipedia. The only physical numbers of Tor exit nodes can be gleaned from the Harvard list, which is itself hundreds of lines long. Indeed, by the maintainer's calculations, Cronbot would have to run for over 14 hours while checking one open proxy every sixty seconds to complete one run of the list.
Beyond this, one key question is necessary: since this is a moderately dangerous bot, how can we trust its output? This bot has the potential to cause many editors to be blocked from editing. To be sure, there are specific points of trust that one must consider for each of this bot's main points, and these sources must be both verifiable and understood to be only as trustworthy as its authority and dependability.
For clarity, here are the main trust issues with this bot in list form (feel free to add). I realize some of these points are givens (and overdone, much like this proposal, perhaps), but please bear with me anyway, as I just want to make sure there are no strong objections to what I'm doing.
Trust issue | Sources of trust | Risk | Cronbot's safeguards |
---|---|---|---|
That a node is in fact an open Tor proxy | The fidelity of the Harvard open proxy list; Adequate checking, good intention, and understanding of the bot's operation by the project members and administrators; That the bot maintainer is not deceptively posting through the bot; That Wikipedia's block logs are up-to-date and accurate |
Faulty editing blocks | IPs are only reported when they are found to not be indefinitely blocked – the blocking aspect is up to an administrator |
That the person reporting to be behind an open, unblocked proxy is in fact the bot | The bot's PGP-encrypted signature; That the bot's private PGP key has not been stolen, and that the native environment is similarly secure; That the IP being reported matches the IP of the reporting editor; Adequate checking of the PGP key by project members; That the public PGP key stored on MIT's PGP servers is the "official" one and matches up with what is on project members' PGP key rings/the bot's report page; The assumption that PGP "works" |
Faulty editing blocks; the risk of inadvertently blocking large IP address pools |
A secure private PGP key as assumed by a faithful community; Semi-permanent storage of the public PGP key on MIT's servers and locally by Cronbot |
That the bot is not a server hog | Good selection of run intervals; That the pages retrieved are conservative of bandwidth; That the bot is doing what the maintainer claims; That the source code for the bot has been properly evaluated and is indicative of the actual code and bot function; |
Bandwidth bills; Network damage (whatever this may entail); Site downtime |
Talk page halts; Auto-slowing mechanisms built into the pywikipedia framework; Fairly consistent "low-bandwidth page" principles; Easily reprogrammed intervals as the community desires |
That the bot is otherwise "safe" | That the bot operates within its defined constraints; In this case, that the bot indeed cannot block editors itself; That the bot is well-designed |
Unfair blocks; damage to the system; A haywire bot with administrative privileges |
Neither I nor my bot are administrators, nor do we plan to be; The community's faith in the maintainer as adhering to principles of good faith, openness, admission of mistakes, and desire to fix problems |
That this is all "worth it" | The assumption that open proxies need to be blocked to protect legitimate edits; The assumption that Tor proxies must follow the same rules as other open proxies; That these methods of proxy checking are tedious and difficult to do manually |
The waste of time, bandwidth, effort, and attention | No guarantees other than the truth of future results |
Progress
[edit]The bot will probably take 1-2 more weeks to complete.
Points worth mentioning
[edit]- The bot does not run in threads, but is made up of several different Python scripts that intercommunicate through lockfiles and can be stopped globally with certain "killswitch" files.
- I will not publish my source code on a userspace page. Instead, I will allow registered users to send an e-mail to the bot, which will cause it to automatically fire back a response with its current source code attached. I feel that this is more conservative of database space, more promising of being up-to-date (as frequent uncontroversial changes will be unavoidable at first), and sufficiently on-demand. It is in no way intended as a means of keeping the source secret.
- I will conform to the "emergency shutoff compliant bots" standard, but may resize the button on the user page, just because it's so big and ugly.
- Again, talk page posts do halt the bot – all of the different scripts that comprise it.
Platform miscellanea
[edit]- OS: Linux 2.6.16-2-k7 #2 Mon May 22 23:23:54 UTC 2006 i686 GNU/Linux
- Python version: 2.3.5
- GnuPG version: 1.4.1
- Memory: 1 GB DDR; 1011 MB reported by OS
processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 8 model name : AMD Athlon(tm) XP 2200+ stepping : 0 cpu MHz : 1797.522 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow ts bogomips : 3601.88
Updated 10:27, 7 September 2006 (UTC)
Questions for the approvals group
[edit]- Should I revise my "timetables"? Currently, public proxies are to be checked at a rate of once per minute – this is a "development" value and may be too fast (or, having re-read Wikipedia:BOT, too slow), but the problem at hand is how to resolve such long lists. Would it be preferable that I do timed refreshes by the node? That is, every listed exit node has an expiration date that is relatively dissimilar to the others?
- Does anyone object to the private proxy-checking methods? Does anyone object to the use of PGP?
Thank you. -- Omicronpersei8 (talk) 12:18, 6 September 2006 (UTC)[reply]
- Well, as someone active in Wikipedia:OP myself, this does seem quite usefull. You mentioned "talk page halts", what exactly do you mean by that? Also, how often will this be checking the block logs at Havard and the List of active blocks here (to check for IPs already banned)? If this is once a minute, its probably not the end of the world, considering how heavy the RC crawling bots are in server use, but I don't see a need for it to be that high, and the Harved blocked IP list site, though unlikely, may block you as a "GET" spammer. Perhaps, every fifteen minutes is better. How fast does the list change normally? Even if you use the List of active blocks, won't you have to check every IP from the Harvard list to see if it was already indef blocked? Perhaps you could generate a sample list. Additionally, it might be nice if you tagged the IPs by adding a category to the talk page that mentions Tor (this could be picked up by User:VoABot II which could cause it check the edit diff). Thanks.Voice-of-All 15:06, 6 September 2006 (UTC)[reply]
- First, I applaud you for being brave enough to wade through all the lengthy, half-asleep rambling I just plastered here.
By "talk halts" I just mean that any new messages to User talk:Cronbot will bring the bot to a stop as policy recommends.
The Harvard site's block logs will be checked once an hour. At this rate, I think they might let me go without throttling me – I am making good use of the list, after all. I'm actually more concerned about the remote possibility of being banned from Tor, which would just mean no more "private" searching. (Sorry about the interval listings in seconds on the sample page I linked; that'll be changed soon to display minute representations instead.) I have no real idea of how quickly the Harvard list is updated. I thought it was on-demand, but I'm not sure. Either way, using this and possibly another site mentioned at Wikipedia talk:OP and backing this up with the "stealth" node discovery should be a good collective Tor countermeasure, I think.
I don't think I'll be using the list of active blocks (got a link to this page? I couldn't find one in my quick search), but rather just checking the block logs for every individual IP harvested. But yes, I will have to check every IP. Given that the properties of blocked/unblocked and open/closed are in flux, making guesses based on history doesn't seem to be a good idea. I hope I addressed your comment correctly.
I like the category idea, and am willing to implement it. I could make the bot put this category on the talk page or user page and could also make it place {{Tor}} on successfully blocked pages (it'll detect blocks itself, but rather slowly in its current setup). {{unblock}} checking is another definite possibility. -- Omicronpersei8 (talk) 16:15, 6 September 2006 (UTC); Updated 23:34, 6 September 2006 (UTC)[reply]- OK, the timing looks fine, if its hourly. Perhaps you could explain the private checking more, though it likely seems OK. The talk page tagging is fine. As for checking if an IP is blocked, the page I was refering to was this[1], which seems better than wading through the whole block log. I suppose though, that if your bot was constantly checking the block log, and clearing its list as tor IPs get blocked, maybe you could use the log. I don't know, whatver works best.Voice-of-All 03:08, 7 September 2006 (UTC)[reply]
- Oh! Yeah, that's actually exactly what I'm using, now that I go back and look at it. I originally thought you were talking about just going page-by-page through Special:Ipblocklist, but yeah, it turns out that's the URL I've been using plus arguments (specifically http://en.wikipedia.org/enwiki/w/index.php?title=Special:Ipblocklist&action=search&limit=&ip=127.0.0.1).
Basically, here's how the private checking scheme works: the bot connects to an open Tor exit node. It downloads an edit form for a page (currently User talk:Cronbot) with wget to see if there is a block notice present. If it is, we wait for the next cycle. If it's not, it starts the reporting process.
Cronbot has a private PGP key that only it can use, but it also has a published public key. Before posting, it runs this console command:gpg --clearsign <<< 'Cronbot: 127.0.0.1'
Replacing '127.0.0.1' with the appropriate IP address. This makes a clearly viewable message with a hash attached, consisting of the previous message ("Cronbot: 127.0.0.1") as encoded by the private key. This hashed message adds approximately 768 more bytes to each unlisted proxy report. To check that this is a valid message from Cronbot, a project member should have a PGP tool installed with Cronbot's public key on his or her keyring. The message the bot attached can then simply be copied and pasted into the project member's PGP program and if the public key can successfully decrypt the hash made with the private key, then we can assume it's a valid posting through the bot. Otherwise, something is afoot. Project members are of course free to skip verification and assume it's really the bot posting, but, in my eyes, it's a risky move that could result in faulty blocks, and not a practice I as the creator could recommend.
Please let me know if I haven't answered your question correctly. You can also read some other points about this bot's use of PGP and how to set up the verification process at User:Cronbot/PGP help. -- Omicronpersei8 (talk) 10:31, 7 September 2006 (UTC)[reply]
- Oh! Yeah, that's actually exactly what I'm using, now that I go back and look at it. I originally thought you were talking about just going page-by-page through Special:Ipblocklist, but yeah, it turns out that's the URL I've been using plus arguments (specifically http://en.wikipedia.org/enwiki/w/index.php?title=Special:Ipblocklist&action=search&limit=&ip=127.0.0.1).
- PGP is fine, though GPGP would be better :) Signing the edits with PGP may be overkill, wikisigs should be enough, but thats up to you as the processing is on your side. I'd like to see this bot's usefulness discussed on the project talk page first, to make sure everyone dealing with the project has a chance for input (may need to be on meta as well). — xaosflux Talk 03:20, 7 September 2006 (UTC)[reply]
- Guinea Pigs Get Paid? Seriously though – it's cool, I understand how PGP might be overkill, but I'm doing it as a precaution against one specific scenario. That is, it seems feasible and easy to post from behind, for example, Starhub, fake the signature of the bot, and get Singapore blocked because an admin not aware of what's going on trusts a fake report. Am I wrong about this? I don't really see a way that one can assume definitively that it's Cronbot posting without going by an authentication scheme (ignoring the fact that PGP isn't necessarily a cure-all either). -- Omicronpersei8 (talk) 10:00, 7 September 2006 (UTC)[reply]
- P.S. If GPGP is analogous to OpenPGP, then yeah, that's what I'm using. -- Omicronpersei8 (talk) 10:25, 7 September 2006 (UTC)[reply]
- LOL got a little over kill with my acronymn letters there, I was referring to GnuPrivacyGuard, but OPGP is good too :) — xaosflux Talk 12:14, 7 September 2006 (UTC)[reply]
- Oh, well then you're in luck, as that's exactly what I'm using. -- Omicronpersei8 (talk) 12:17, 7 September 2006 (UTC)[reply]
- LOL got a little over kill with my acronymn letters there, I was referring to GnuPrivacyGuard, but OPGP is good too :) — xaosflux Talk 12:14, 7 September 2006 (UTC)[reply]
- P.S. If GPGP is analogous to OpenPGP, then yeah, that's what I'm using. -- Omicronpersei8 (talk) 10:25, 7 September 2006 (UTC)[reply]
- Guinea Pigs Get Paid? Seriously though – it's cool, I understand how PGP might be overkill, but I'm doing it as a precaution against one specific scenario. That is, it seems feasible and easy to post from behind, for example, Starhub, fake the signature of the bot, and get Singapore blocked because an admin not aware of what's going on trusts a fake report. Am I wrong about this? I don't really see a way that one can assume definitively that it's Cronbot posting without going by an authentication scheme (ignoring the fact that PGP isn't necessarily a cure-all either). -- Omicronpersei8 (talk) 10:00, 7 September 2006 (UTC)[reply]
- OK, the timing looks fine, if its hourly. Perhaps you could explain the private checking more, though it likely seems OK. The talk page tagging is fine. As for checking if an IP is blocked, the page I was refering to was this[1], which seems better than wading through the whole block log. I suppose though, that if your bot was constantly checking the block log, and clearing its list as tor IPs get blocked, maybe you could use the log. I don't know, whatver works best.Voice-of-All 03:08, 7 September 2006 (UTC)[reply]
- First, I applaud you for being brave enough to wade through all the lengthy, half-asleep rambling I just plastered here.
Oh heck Xaosflux this thing is godly. We've needed something like this for a while, all we need to do is softblock the tor exit nodes and we're good :) - The rate looks fine, I assume you don't need a bot flag -- Tawker 18:25, 7 September 2006 (UTC)[reply]
- Well on that note, this is approved for trial:2-4 weeks, keep track of any bugs, and post report here. — xaosflux Talk 01:58, 8 September 2006 (UTC)[reply]
Thanks everybody! -- Omicronpersei8 (talk) 02:07, 8 September 2006 (UTC)[reply]
- This bot seems to have not made any edits in a while, how is the testing going?Voice-of-All 23:26, 22 September 2006 (UTC)[reply]
- I'm really sorry, but I suddenly got busy. I made a note on the bot's user page, but I guess I was afraid to waste space here with it. Anyway, I'm not going to make the deadline. I would ask if I should reapply, but pgk has brought up some good points below that I may need to evaluate (or reevaluate) before proceeding. -- Omicronpersei8 (talk) 05:30, 30 September 2006 (UTC)[reply]
- This bot seems to have not made any edits in a while, how is the testing going?Voice-of-All 23:26, 22 September 2006 (UTC)[reply]
- I've worked on something similar in the past (different mechanism for retrieving tor details), which I've offered to share if it will help speed up development. I do have a couple of problems with this (a) this is effecively publishing a list of tor nodes which aren't blocked, it is possible to get tor to use a specific out proxy, so this could be misused (b) What should we actually be doing with tor? THere are more tor nodes not blocked than are currently blocked (though proxyblocker covers some (many?) of the unblocked). As we know some users in China and the like rely on TOR for editing, should we be blocking these irrespective of abuse? If we are should we be blocking anon only? If so is an admin bot a better proposal to actually do the work? (c) Connected to (b) many tor nodes are actually on dynamic addresses, so unblocking when they move is also important. --pgk 15:57, 23 September 2006 (UTC)[reply]
- First, thanks so much for your offer and your experienced help. Regarding your concerns:
(a) I agree that this is a conceivable problem, but since Tor nodes can only be specifically targeted using node nicknames, the assumption here is that all Tor nodes are accessible this way, and I'm not sure if this is the case. Are you? As I've said before at some point, my bot is currently designed to read off a public list and process them into a list here and also do manual, random connects to Tor nodes from within the network. The ones in the network are only identified by IP. Might it be a possible solution to the problem you are addressing if the bot only published IPs and not hostnames?
(b) This is something I tried to ignore since I'm not an admin and therefore "It's Not My Problem", especially considering the preexistence of policy against Tor nodes (or what I thought was policy). My original intent was really to leave this call up to the blocking admin, but if I'm doing more harm than good (e.g. for Chinese users), I'm all ears. I would be in support of not blocking a node until abuse happens on it, although that is somewhat inviting vandals to deliberately ruin it for legitimate users, but this is just my two cents.
(c) This is also something I was planning to leave out of my scope of concern, as I know that if this bot is ever finished, it will cause a lot of {{unblock}}s already. Sorry if that sounds callous, but again, my intent here was just to make a bot that helps in following policy, blindly or not.
Regardless of what I've said above, if it is felt that this bot would be damaging in the long run, please let me know. I don't want to be responsible for something like that, nor do I particularly like the idea of incurring the wrath of thousands. -- Omicronpersei8 (talk) 05:30, 30 September 2006 (UTC)[reply]- (a) yes you can explicitly route through any router, each router also has a fingerprint which can be specified to select the out node. (The tor documentation tells you about registration node which links the node name to the fingerprint)
(b) Well policy is descriptive not prescriptive. i.e. its what we actually do, currently we don't block all tor nodes, turning around and blocking them all may be problematic. Also we have since had the change in what we can do regarding blocking, it would seem sensible to actually get some broader input on this issue.
(c) I don't think we can ignore this issue. If the result of this is that a string of IPs get blocked as tor, which aren't then it is certain to cause problems, even if users request unblock, the chances of them getting unblocked promptly would seem limited.
So my concerns still stand, we'd be publishing a list of usable proxies, we'd potentially block useful editors against our current practises and we wouldn't have a mechanism in place to deal with IPs which no longer operate tor nodes, either because the node stops running tor, or the worse case of dynamic IPs which change on a regular basis. --pgk 08:54, 30 September 2006 (UTC)[reply]
- (a) yes you can explicitly route through any router, each router also has a fingerprint which can be specified to select the out node. (The tor documentation tells you about registration node which links the node name to the fingerprint)
- First, thanks so much for your offer and your experienced help. Regarding your concerns:
- I've worked on something similar in the past (different mechanism for retrieving tor details), which I've offered to share if it will help speed up development. I do have a couple of problems with this (a) this is effecively publishing a list of tor nodes which aren't blocked, it is possible to get tor to use a specific out proxy, so this could be misused (b) What should we actually be doing with tor? THere are more tor nodes not blocked than are currently blocked (though proxyblocker covers some (many?) of the unblocked). As we know some users in China and the like rely on TOR for editing, should we be blocking these irrespective of abuse? If we are should we be blocking anon only? If so is an admin bot a better proposal to actually do the work? (c) Connected to (b) many tor nodes are actually on dynamic addresses, so unblocking when they move is also important. --pgk 15:57, 23 September 2006 (UTC)[reply]
Just to make a poor but well-intentioned update for the record, it would not take me more than a week to finish the bot as it stands. All that's really left is to finish the code to parse the public list, make the multiple script files that comprise the bot run in parallel reliably, and get anonymous posting working (since that's not something pywikipedia natively allows as far as I know). -- Omicronpersei8 (talk) 05:35, 30 September 2006 (UTC)[reply]
Since the this bot has been minorly controversial and certainly slow progress-wise, I am going to go ahead and fold on it in light of this: Wikipedia:Requests for adminship/TawkerbotTorA
That bot will be faster, more efficient, and handled by more able folks (who are, additionally, administrators and sysops), even if it doesn't succeed in its RFA. I'm a little averse to giving up on this project, as I'm certainly willing to finish it, but in light of a better alternative (and one, from a selfish perspective, that won't put the bulk of responsibility for blocking Tor nodes on myself, but rather in the hands of sysops), I'm going to go ahead and admit defeat on this issue. These guys have a cool-looking plan and I think I will come to support it once I learn more about it.
I intend to have something else for Cronbot to do in the not-too-distant future which will probably not require as much work or careful consideration. In the meantime, it will remain my inactive sock account that I will certainly refrain from making edits through. If an administrator would like to make a posting on Wikipedia:BRFA indicating that this project is now cancelled and closed, he or she is more than welcome to do so.
Thanks for all your considerations, and sorry I didn't deliver.
-- Omicronpersei8 (talk) 06:39, 5 October 2006 (UTC) WITHDRAWN by operator — xaosflux Talk 12:33, 5 October 2006 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.