Jump to content

Wikipedia:Link rot/URL change requests/Archives/2021/April

From Wikipedia, the free encyclopedia


odiseos.net is now a gambling website

There were two references to this website. I have removed one. The archived url has the content. Should this citation be preserved or removed?

My edit and existing citation -- DaxServer (talk) 07:50, 9 April 2021 (UTC)

This is a usurped domain. Normally they would be changed to |url-status=usurped. The talk page instance is removed because the "External links modified" section can be removed it is an old system no longer used. I'll need to update the InternetArchiveBot database to indicate this domain should be blacklisted but the service is currently down for maintenance. https://iabot.toolforge.org/ -- GreenC 17:10, 9 April 2021 (UTC)
I have also reverted my edit to include the |url-status=usurped (new edit). -- DaxServer (talk) 20:33, 9 April 2021 (UTC)

Links to https://www.nytimes.com/movies/person/* are dead and reporting as a soft-404 thus not picked up by archive bots. There are about 1300 articles with links in https and about 150 in http. The URLs are to The New York Times, but the content is licensed to All Movie Guide thus if in a CS1|2 citation it would convert to |work=All Movie Guide and |via=The New York Times. In addition an archive URL if available otherwise marked dead. Extra credit it could try to determine the date and author by scraping the archive page. Example. -- GreenC 18:00, 6 April 2021 (UTC)

Results

  • Edited 11,160 articles
  • Add 14,871 new archive URLs
  • Change metadata in 12,855 citations (eg. |work=)
  • Toggle 704 existing archives with |url-status=live --> |url-status=dead
  • Add 208 {{dead link}}
  • Various other general fixes

-- GreenC 00:25, 15 April 2021 (UTC)

There are several thousand "http" links on WP to many different pages of my site (whose homepage is http://penelope.uchicago.edu/Thayer/E/home.html) which really should be httpS. The site is secure with valid certificates, etc. Is this something a bot can take care of quickly?

24.136.4.218 (talk) 19:20, 11 February 2021 (UTC)

In general, I think a bot could replace http with https for all webpages, after some checks. (The guidelines prefer https over http, WP:External links#Specifying_protocols.) My naive idea is to create a bot that goes through http-links and checks if they are also valid with https. If they are, then the bot can replace the http-link with the https-link. Apart from the question if there is a general problem with the idea, a few questions remain:
  1. Should the bot replace all links or only major ones (official webpage, info boxes, ...)?
  2. Should the bot only check if https works or if http and https provide the same page?
I would be happy to hear what others think about the idea. Nuretok (talk) 11:43, 5 April 2021 (UTC)
One argument against this is that, many websites implement a http -> https redirect. Thus if one access the link with http, it will be redirected to https. In this case, it would not matter what protocol the link is in WP, the user would always end up on https. Even the cited example above is redirected. -- Srihari Thalla (talk) 19:09, 8 April 2021 (UTC)
You are right, many websites forward http to https, but this still allows a Man-in-the-middle attack when someone prevents this redirect. This is one of the reasons the Wikipedia-guidelines recommend using https and browser plugins such as HTTPS Everywhere exist. Of course everyone is free to use https everywhere, but providing good defaults (https in this case) is usually considered good practice. By the way, instead of checking each site individually, there is a list of servers that support https which the bot could check to see if it wants to move from http to https.Nuretok (talk) 08:20, 17 April 2021 (UTC)

Several years ago all the content on this subdomain was moved to timesofindia.indiatimes.com. However, the links are not the same and don't have any redirects and also cannot be re-constructed or guessed using any algorithms. One has to search in Google with the title of the link with former domain and update the link with the new domain.

LinkSearch

Old URL - http://articles.timesofindia.indiatimes.com/2001-06-28/pune/27238747_1_lagaan-gadar-ticket-sales (archived)

New URL - https://timesofindia.indiatimes.com/city/pune/film-hungry-fans-lap-up-gadar-lagaan-fare/articleshow/1796672357.cms

Is there a possibility for a WP:SEMIAUTOMATED bot with inputs from the user about the new url and update WP? Is there an existing bot? If not, I created a small semi-automated script (here) to assist me with the same functionality. Do I need to get an approval for this bot, if this is even considered a bot? -- Srihari Thalla (talk) 19:20, 8 April 2021 (UTC)

Are you seeing problems with content drift (content at the new page is different from the old). You'll need to handle existing |archive-url=, |archive-date= and |url-status= since can't change |url= and not |archive-url=, which if changed has to be verified working. There is {{webarchive}} that sometimes follow bare and square links might need removed or changed. The |url-status= should be updated from dead to live. There are {{dead link}} that might need to be added or removed. Should verify the new URL is working not assume it does; and if there are redirects in the headers capture those and change the URL to reflect. Those are the basics for this kind of work, it is not easy. Keep in mind there are 3 basic types of cites: those within a cite template, those in a square link, and those bare. Of those three types, the square and bare may have a trailing {{webarchive}}. All types may have a trailing {{dead link}}.
OR, my bot is done and can do all this. All that would be needed is a map of old and new URLs. There are as many as 20,000 URLs do you propose manually searching for each one? Perhaps better to leave unchanged and add archive URLs. Those that have no archive URL (ie. {{dead link}}) manually search for those to start. I could generate a list of those URLs with {{dead link}} while making sure everything else is archived. -- GreenC 20:24, 8 April 2021 (UTC)
If you already have the bot ready, then we can start with those that have no archive URL. If you could generate the list, I could also post on WP:INDIA asking for volunteers.
I would suggest to do this work, using a semi automated script ie., the script would read the page with the list and parse each row and print it on terminal (all details of the link possible, full cite/link title/etc) so that it would be easy for the user to search and once the new URL is found, the script takes the input and saves it to the page. Do you think this would be faster and convenient?
I would also suggest to form the list using the columns: serial number, link, cite/bare/square link, title (if possible), new url, new url status, new archive url, new archive url date. The last new ones being blank to be filled once researched. Does these columns look good?
Do you have a link to your bot? -- DaxServer (talk) 07:45, 9 April 2021 (UTC)
How about provide you with as much data as possible in a regular parsable format. I'd prefer not to create the final table as that should be done by the author of the semi-automated script based on its requirements and location. Does that sound OK? The bot page is User:GreenC/WaybackMedic_2.5 however it is 3 years out of date as is the GitHub repo, there have been many changes since 2018. The main bot is nearly 20k lines, but each URLREQ move request has its own custom module that is smaller. I can post an example skeleton module if you are interested, it is in Nim (programming language) which is similar to Python syntax. -- GreenC 18:24, 9 April 2021 (UTC)
The data in a parsable format is a good one to start with. Based on this, a suitable workflow can be established over time. The final table can be done later, as you said.
I unfortunately never heard of Nim. I know a little bit of Python and could have looked at Nim, but I do not have any time until mid-May. Would this be an example of a module, citeaddl? But this is Medic 2.1 and not 2.5. Perhaps you could share the example. If that looks like I can deal with without much learning curve, I would be able to workout something. If not, I would have to wait until end of May and then evaluate again! -- DaxServer (talk) 20:24, 9 April 2021 (UTC)

User:GreenC/software/urlchanger-skeleton-easy.nim is a generic skeleton source file. To give a sense of what is involved. It only needs modifying some variable at the top defining the domains old and new. There is a "hard" skeleton for more custom needs where mods are done throughout the file when the easy version is not enough. The file is part of the main bot, isolating domain-specific changes to this file. I'll start on the above it will take a few days probably depending how many URLs are found. -- GreenC 01:42, 11 April 2021 (UTC)

@DaxServer: The bot finished. Cites with {{dead link}} are recorded at Wikipedia:Link rot/cases/Times of India (raw) about 150. -- GreenC 20:57, 16 April 2021 (UTC)

Good to hear! Thanks @GreenC -- DaxServer (talk) 11:16, 17 April 2021 (UTC)

Results

  • Edits to 9,509 articles
  • New archive URLs added 15,269
  • Toggled 1,167 |url-status=live to |url-status=dead
  • {{dead link}} added about 100
  • 11,941 cites changed metadata (eg. normalized |work=, removed "Times of India" from |title=)

Migrate old URLs of "thehindu.com"

Old URLs from sometime before 2010 have a different URL structure. The content is moved to a new URL but a direct redirect is not available. The old URL is redirected to list page which is categorized by date the article is published. One has to search the title of the article and follow the link. Surprisingly, some archived URLs I tested were redirected to the new archived URL. My guess is that the redirection worked in the past, but was broken at some point.

Old URL - http://hindu.com/2001/09/06/stories/0406201n.htm (archived in 2020 - automatically redirected to the new archived url; old archive from 2013)

Redirected to list page - https://www.thehindu.com/archive/print/2001/09/06/

Title - IT giant bowled over by Naidu

New URL from the list page - https://www.thehindu.com/todays-paper/tp-miscellaneous/tp-others/it-giant-bowled-over-by-naidu/article27975551.ece

There is no content shift from the old URL (2013 archive) and new URL.

Example from N. Chandrababu Naidu - PS. This citation is used twice (as searched by the title), one with old url and the other with new url. -- DaxServer (talk) 14:18, 9 April 2021 (UTC)

The new URL [1] is behind a paywall and unreadable while the archive of the old URL [2] is fully readable. I think it would be preferable to maintain archives of the old URLs since they are not paywalled and there would be no content drift concern. Perhaps similar to above attempt to migrate when a soft-404 that redirects to a list page when no archive is available. -- GreenC 17:37, 9 April 2021 (UTC)
In that case, perhaps the WaybackMedic or the IA bot can add archived urls to all these links? If you want to be more specific, here is the regex of the URLs that I have found so far. There can be others which I have not encountered yet.
https?\:\/\/(www\.)?(the)?hindu\.com\/(thehindu\/(fr|yw|mp|pp|mag)\/)?\d{4}\/[01]\d\/[0-3][0-9]\/stories\/[0-9a-z]+\.htm
-- DaxServer (talk) 20:39, 9 April 2021 (UTC)
Can you verify the regex because I don't think it would match on the above "Old URL" in the segment \d{4}\/[01]\d\/[0-3][0-9]\/ .. maybe it is a different URL variation? -- GreenC 21:52, 9 April 2021 (UTC)
It matches. I checked it on regex101 and also on Python cli. Maybe, here is a simpler regex.
https?\:\/\/(www\.)?(the)?hindu\.com\/(thehindu\/(fr|yw|mp|pp|mag)\/)?\d{4}\/\d{2}\/\d{2}\/stories\/[0-9a-z]+\.htm -- DaxServer (talk) 12:02, 10 April 2021 (UTC)
Ahh got it sorry misread thanks. -- GreenC 13:33, 10 April 2021 (UTC)
Regex modified to work with Elasticsearch insource: and some additional matches. 12,229
insource:/\/{2}(www[.])?(the)?hindu[.]com\/(thehindu\/)?((cp|edu|fr|lf|lr|mag|mp|ms|op|pp|seta|yw)\/)?[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/stories\/[^.]+[.]html?/
-- GreenC 04:27, 17 April 2021 (UTC)

DaxServer, the Hindu is done. Dead link list: Wikipedia:Link rot/cases/The Hindu (raw). -- GreenC 13:24, 23 April 2021 (UTC)

Great work @GreenC !! -- DaxServer (talk) 16:58, 23 April 2021 (UTC)
  • Edits to 11,985 articles
  • New archive URLs added 15,954
  • Toggled 2,412 |url-status=live to dead
  • Added 1,244 {{dead link}}
  • 12,234 cites changed metadata (eg. normalized |work=, removed "The Hindu" from |title=, etc.)
  • Updated the IABot database, each link individually set blacklisted.

sify.com

Any link that redirects to the home page. Example. Example.-- GreenC 14:27, 17 April 2021 (UTC)

Results

  • Add 4,132 new archive links (Example)
  • Add or modify 1,149 |url-status=dead (Example)
  • Set links "Blacklisted" in IABot database

*.in.com

Everything dead. Some redirect to a new domain homepage unrelated to previous site. Some have 2-level deep sub-domains. All now set to "Blacklisted" in IABot for global wiki use, a Medic pass through on enwiki will also help. -- GreenC 04:13, 25 April 2021 (UTC)

Results

  • Edited 3,803 articles
  • Added 3,863 new archive URLs
  • Changed/added 732 |url-status=dead to existing archive URLs
  • Added 104 {{dead link}}
  • Set individual links "Blacklisted" in IABot database