Jump to content

User:WebCiteBOT

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by ThaddeusB (talk | contribs) at 17:36, 10 June 2009 (Frequently Asked Questions: update status - feature adding/testing is more or less complete, but WebCite downtime has inhibited progress greatly). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

WebCiteBOT's purpose is to combat link rot by automatically WebCiting newly added URLs. It is written coded in Perl and runs automatically with only occasional supervision.

A complete log of the bot's activity, organized by date, can be found under User:WebCiteBOT/Logs/. Some interesting statistics related to its operation can be found at User:WebCiteBOT/Stats.

Operation: WebCiteBOT monitor's the URL addition feed at IRC channel #wikipedia-en-spam and notes the time of each addition, but take no immediate action. After 48 hours (or more) have passed it goes back and checks the article to see if the new link is still in place and if it is used as a reference (i.e. not as an external link). These precautions are help prevent the archiving of spam/unneeded URLs.

Articles that have been tagged/nominated for deletion are skipped until the issue is resolved.

For each valid reference it finds, WebCiteBOT first checks its database to see if a recent archive was made. If not, it checks the functionality of the link. Valid links are submitted for archival at WebCitation.org, while dead links are tagged with {{dead link}}. After the archival attempt has had time to complete, the bot checks the archive's status and updates the corresponding Wikipedia page if the archive was completed successfully. It will also attempt to add title, author, and other metadata that wasn't supplied by the human who added the link.

Features not yet implemented:

  • Ability to archive all links on a specific page on demand
  • Build database of "problem" sites to save time
  • Tag invalid links with {{dead link}} (Implemented June 6, 2009)
  • More robust capture of metadata; build db of human supplied metadata to assist bot in determining certain item (update: Bot is now capturing human entered data for each page it loads in order to build this db)
  • Attempt to locate archive for older links when updating a page (maybe)

Known Issues/Limitations:

  • Some links additions are not reported to #wikipedia-en-spam (likely because there are too many edits for the reporting bot to examine every one) and thus are not caught by WebCiteBOT.
  • Some link additions reports get cut off due to IRC imposed string limits. Thus, not all links are caught when approximately 4 or more links are added at once.
  • WebCiteBOT is not able to distinguish between true new additions and additions caused by reverts and such. Thus, sometimes a "new" link is actually fairly old and the archived version may not match the version the original editor saw.
  • WebCitation.org does not archive some pages due to robot restrictions. A small number of additional pages are archived incorrectly. (WebCiteBOT normally catches these and doesn't link to them.)

Feel free to make a suggestion to improve the bot.

Frequently Asked Questions

Q. I just added a new URL to [[some page]]; what should I do now?

A. You don't have to do anything. The bot constantly monitors an IRC feed which reports most link additions. It stores every link reported there and archives them after 2-3 days time. A feature is currently in the works to allow on demand archival of very time sensitive links, but for now it is relying entirely on the IRC feed.

Q. Why wasn't [http://somepage.com/somefile.htm] archived?

A. The most common reason is the website in question has a robots restriction in place that asks robots not to cache their content. However, there are a number of other possibilities as well. (See known limitations section above.)

Q. Why is the BOT so far behind?

A. While waiting for final approval, it built up a substantial backlog. Additionally, even after approval ThaddeusB continued to tweak the code to make sure unusual cases are handled currently and to add new features. This process is pretty much complete as of June 7, 2009, however Webcitation.org has been having a lot of problems since at least June 3. The site's frequent downtime has prevented the BOT from making much progress. All links have been saved and will be archived as soon as possible. The current backlog is about 500,000 links, so please be patient. :)

Q. Why are there screwed up UTF8 characters in the log?

A. Unfortunately the IRC feed the bot relies on sometimes messes up two-byte characters. WebCiteBOT has been programed to try an alternative title where "messed up" characters are corrected based on common patterns if the first title it tries doesn't exist. In can only do this after first checking the title as provided, as sometimes title that look messed up actually aren't. The log always reflects the first title tried, but the actual operation of the bot uses the corrected title when it can figure one out.