Jump to content

User:WebCiteBOT: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
m you meant June right?
Inactive bot
 
(17 intermediate revisions by 6 users not shown)
Line 1: Line 1:
{{bot|ThaddeusB|status=active}}
{{bot|ThaddeusB|status=inactive}}


'''Update (12/23/14)''' - I am working on getting the BOT up and running again. There seem to be some technical problems with WebCite (requests giving time out messages, but actually completing) at the moment. I am working on a work-a-around for my code. --[[User:ThaddeusB|ThaddeusB]] ([[User talk:ThaddeusB|talk]]) 18:58, 23 December 2014 (UTC)
WebCiteBOT's purpose is to combat [[link rot]] by automatically [[WebCite|WebCiting]] newly added URLs. It is written coded in Perl and runs automatically with only occasional supervision.

WebCiteBOT's purpose is to combat [[link rot]] by automatically [[WebCite|WebCiting]] newly added URLs. It is written in Perl and runs automatically with only occasional supervision.


A complete log of the bot's activity, organized by date, can be found under [[User:WebCiteBOT/Logs/]]. Some interesting statistics related to its operation can be found at [[User:WebCiteBOT/Stats]].
A complete log of the bot's activity, organized by date, can be found under [[User:WebCiteBOT/Logs/]]. Some interesting statistics related to its operation can be found at [[User:WebCiteBOT/Stats]].


'''Operation:''' WebCiteBOT monitor's the URL addition feed at [[IRC]] channel #wikipedia-en-spam and notes the time of each addition, but take no immediate action. After 48 hours (or more) have passed it goes back and checks the article to see if the new link is still in place and if it is used as a reference (i.e. not as an external link). These precautions are help prevent the archiving of spam/unneeded URLs.
'''Operation:''' WebCiteBOT monitors the URL addition feed at [[IRC]] channel #wikipedia-en-spam and notes the time of each addition, but takes no immediate action. After 48 hours (or more) have passed it goes back and checks the article to see if the new link is still in place and if it is used as a reference (i.e. not as an external link). These precautions help prevent the archiving of spam/unneeded URLs.


Articles that have been tagged/nominated for deletion are skipped until the issue is resolved.
Articles that have been tagged/nominated for deletion are skipped until the issue is resolved.


For each valid reference it finds, WebCiteBOT first checks its database to see if a recent archive was made. If not, it checks the functionality of the link. Valid links are submitted for archival at WebCitation.org, while dead links are tagged with {{tl|dead link}}. After the archival attempt has had time to complete, the bot checks the archive's status and updates the corresponding Wikipedia page if the archive was completed successfully. It will also attempt to add title, author, and other [[metadata]] that wasn't supplied by the human who added the link.
For each valid reference it finds, WebCiteBOT first checks its database to see if a recent archive was made. If not, it checks the functionality of the link. Valid links are submitted for archiving at WebCitation.org, while dead links are tagged with {{tl|dead link}}. After the archival attempt has had time to complete, the bot checks the archive's status and updates the corresponding Wikipedia page if the archive was completed successfully. It will also attempt to add title, author, and other [[metadata]] that wasn't supplied by the human who added the link.


'''Features not yet implemented:'''
'''Features not yet implemented:'''
* Ability to archive all links on a specific page on demand
* Ability to archive all links on a specific page on demand
* Build database of "problem" sites to save time
* Build database of "problem" sites to save time
* <s>Tag invalid links with {{tl|dead link}}</s> (Implemented June 6, 2009)
* <s>Tag invalid links with {{tl|dead link}}</s> (Implemented June 6, 2009)
* More robust capture of metadata; build db of human supplied metadata to assist bot in determining certain item ('''update:''' Bot is now capturing human entered data for each page it loads in order to build this db)
* More robust capture of metadata; build db of human supplied metadata to assist bot in determining certain items ('''update:''' Bot is now capturing human entered data for each page it loads in order to build this db)
* Attempt to locate archive for older links when updating a page (''maybe'')
* Attempt to locate archive for older links when updating a page (''maybe'')


'''Known Issues/Limitations:'''
'''Known Issues/Limitations:'''
* Some links additions are not reported to #wikipedia-en-spam (likely because there are too many edits for the reporting bot to examine every one) and thus are not caught by WebCiteBOT.
* Some link additions are not reported to #wikipedia-en-spam (likely because there are too many edits for the reporting bot to examine every one) and thus are not caught by WebCiteBOT.
* The link reporting bot will "un-encode" characters that are [[URL encode]]d (e.g. "%80%99") which will make my bot unable to find the link in the wikitext and report it as "removed". (A workaround was added to the code February 26, 2012 to "save" a few of these.)
* Some link additions reports get cut off due to IRC imposed string limits. Thus, not all links are caught when approximately 4 or more links are added at once.
* WebCiteBOT is not able to distinguish between true new additions and additions caused by reverts and such. Thus, sometimes a "new" link is actually fairly old and the archived version may not match the version the original editor saw.
* WebCiteBOT is not able to distinguish between true new additions and additions caused by reverts and such. Thus, sometimes a "new" link is actually fairly old and the archived version may not match the version the original editor saw.
* WebCitation.org does not archive some pages due to [[Robots exclusion standard|robot restrictions]]. A small number of additional pages are archived incorrectly. (WebCiteBOT normally catches these and doesn't link to them.)
* WebCitation.org does not archive some pages due to [[Robots exclusion standard|robot restrictions]]. A small number of additional pages are archived incorrectly. (WebCiteBOT normally catches these and doesn't link to them.)
* WebCiteBOT does not follow redirects. This means if a page is moved after a link is added, but before the bot looks at it, it will be reported as "(link) has been removed". It is not clear to me whether following redirects would be a desirable behavior or not.


Feel free to [http://en.wikipedia.org/enwiki/w/index.php?title=User_talk:WebCiteBOT&action=edit&section=new make a suggestion] to improve the bot.
Feel free to [http://en.wikipedia.org/enwiki/w/index.php?title=User_talk:WebCiteBOT&action=edit&section=new make a suggestion] to improve the bot.


{{User:UBX/WebCite}}
{{clear}}
===Frequently Asked Questions===
===Frequently Asked Questions===
'''''Q. I just added a new URL to <nowiki>[[some page]]</nowiki>; what should I do now?'''''
'''''Q. I just added a new URL to <nowiki>[[some page]]</nowiki>; what should I do now?'''''
:A. You don't have to do anything. The bot constantly monitors an IRC feed which reports most link additions. It stores every link reported there and archives them after 2-3 days time. A feature is currently in the works to allow on demand archival of very time sensitive links, but for now it is relying entirely on the IRC feed.
:A. You don't have to do anything. The bot constantly monitors an IRC feed which reports most link additions. It stores every link reported there and archives them after 2-3 days time. A feature is currently in the works to allow on-demand archiving of very time sensitive links, but for now it is relying entirely on the IRC feed.


'''''Q. Why wasn't <nowiki>[http://somepage.com/somefile.htm]</nowiki> archived?'''''
'''''Q. Why wasn't <nowiki>[http://somepage.com/somefile.htm]</nowiki> archived?'''''
:A. The most common reason is the website in question has a [[Robots exclusion standard|robots restriction]] in place that asks robots not to [[cache]] their content. However, there are a number of other possibilities as well. (See known limitations section above.)
:A. The most common reason is the website in question has a [[Robots exclusion standard|robots restriction]] in place that asks robots not to [[cache]] their content. However, there are a number of other possibilities as well. (See known limitations section above.)

'''''Q. Why is the BOT so far behind?'''''
:A. While waiting for final approval, it built up a substantial backlog. Additionally, [[User:ThaddeusB|ThaddeusB]] has continued to tweak the code to make sure unusual cases are handled currently and to add new features. This process is almost complete and the bot will soon start to catch up. All links have been saved and will be archived shortly. The current backlog is nearly 500,000 links, so please be patient. :)


'''''Q. Why are there screwed up [[UTF8]] characters in the log?'''''
'''''Q. Why are there screwed up [[UTF8]] characters in the log?'''''
:A. Unfortunately the IRC feed the bot relies on sometimes messes up two-byte characters. WebCiteBOT has been programed to try an alternative title where "messed up" characters are corrected based on common patterns if the first title it tries doesn't exist. In can only do this after first checking the title as provided, as sometimes title that ''look'' messed up actually aren't. The log always reflects the first title tried, but the actual operation of the bot uses the corrected title when it can figure one out.
:A. Unfortunately the IRC feed the bot relies on sometimes messes up two-byte characters. WebCiteBOT has been programed to try an alternative title where "messed up" characters are corrected based on common patterns if the first title it tries doesn't exist. It can only do this after first checking the title as provided, as sometimes titles that ''look'' messed up actually aren't. The log always reflects the first title tried, but the actual operation of the bot uses the corrected title when it can figure one out.

==Recognition==
{| style="border: 1px solid gray; background-color: #fdffe7;"
|rowspan="2" valign="middle" | [[Image:Vitruvian Barnstar.png|75px]]
|rowspan="2" |
|style="font-size: x-large; padding: 0; vertical-align: middle; height: 1.1em;" | '''The da Vinci Barnstar'''
|-
|style="vertical-align: middle; border-top: 1px solid gray;" | Excellent bot, thank you. [[User:Jezhotwells|Jezhotwells]] ([[User talk:Jezhotwells|talk]]) 23:05, 10 June 2009 (UTC)
|}

Latest revision as of 04:05, 20 April 2021

Update (12/23/14) - I am working on getting the BOT up and running again. There seem to be some technical problems with WebCite (requests giving time out messages, but actually completing) at the moment. I am working on a work-a-around for my code. --ThaddeusB (talk) 18:58, 23 December 2014 (UTC)

WebCiteBOT's purpose is to combat link rot by automatically WebCiting newly added URLs. It is written in Perl and runs automatically with only occasional supervision.

A complete log of the bot's activity, organized by date, can be found under User:WebCiteBOT/Logs/. Some interesting statistics related to its operation can be found at User:WebCiteBOT/Stats.

Operation: WebCiteBOT monitors the URL addition feed at IRC channel #wikipedia-en-spam and notes the time of each addition, but takes no immediate action. After 48 hours (or more) have passed it goes back and checks the article to see if the new link is still in place and if it is used as a reference (i.e. not as an external link). These precautions help prevent the archiving of spam/unneeded URLs.

Articles that have been tagged/nominated for deletion are skipped until the issue is resolved.

For each valid reference it finds, WebCiteBOT first checks its database to see if a recent archive was made. If not, it checks the functionality of the link. Valid links are submitted for archiving at WebCitation.org, while dead links are tagged with {{dead link}}. After the archival attempt has had time to complete, the bot checks the archive's status and updates the corresponding Wikipedia page if the archive was completed successfully. It will also attempt to add title, author, and other metadata that wasn't supplied by the human who added the link.

Features not yet implemented:

  • Ability to archive all links on a specific page on demand
  • Build database of "problem" sites to save time
  • Tag invalid links with {{dead link}} (Implemented June 6, 2009)
  • More robust capture of metadata; build db of human supplied metadata to assist bot in determining certain items (update: Bot is now capturing human entered data for each page it loads in order to build this db)
  • Attempt to locate archive for older links when updating a page (maybe)

Known Issues/Limitations:

  • Some link additions are not reported to #wikipedia-en-spam (likely because there are too many edits for the reporting bot to examine every one) and thus are not caught by WebCiteBOT.
  • The link reporting bot will "un-encode" characters that are URL encoded (e.g. "%80%99") which will make my bot unable to find the link in the wikitext and report it as "removed". (A workaround was added to the code February 26, 2012 to "save" a few of these.)
  • WebCiteBOT is not able to distinguish between true new additions and additions caused by reverts and such. Thus, sometimes a "new" link is actually fairly old and the archived version may not match the version the original editor saw.
  • WebCitation.org does not archive some pages due to robot restrictions. A small number of additional pages are archived incorrectly. (WebCiteBOT normally catches these and doesn't link to them.)
  • WebCiteBOT does not follow redirects. This means if a page is moved after a link is added, but before the bot looks at it, it will be reported as "(link) has been removed". It is not clear to me whether following redirects would be a desirable behavior or not.

Feel free to make a suggestion to improve the bot.

This user keeps citations to online sources working with the help of WebCite!

Frequently Asked Questions

[edit]

Q. I just added a new URL to [[some page]]; what should I do now?

A. You don't have to do anything. The bot constantly monitors an IRC feed which reports most link additions. It stores every link reported there and archives them after 2-3 days time. A feature is currently in the works to allow on-demand archiving of very time sensitive links, but for now it is relying entirely on the IRC feed.

Q. Why wasn't [http://somepage.com/somefile.htm] archived?

A. The most common reason is the website in question has a robots restriction in place that asks robots not to cache their content. However, there are a number of other possibilities as well. (See known limitations section above.)

Q. Why are there screwed up UTF8 characters in the log?

A. Unfortunately the IRC feed the bot relies on sometimes messes up two-byte characters. WebCiteBOT has been programed to try an alternative title where "messed up" characters are corrected based on common patterns if the first title it tries doesn't exist. It can only do this after first checking the title as provided, as sometimes titles that look messed up actually aren't. The log always reflects the first title tried, but the actual operation of the bot uses the corrected title when it can figure one out.

Recognition

[edit]
The da Vinci Barnstar
Excellent bot, thank you. Jezhotwells (talk) 23:05, 10 June 2009 (UTC)