Jump to content

User:GreenC/software/linkrot

From Wikipedia, the free encyclopedia

How to fix link rot - on Wikipedia and elsewhere.

The 3 results

[edit]

Given a URL, there are only 3 basic results:

  1. Convert to an archive URL: http://example.com --> https://web.archive.org/web/20240601/http://example.com
  2. Move to a new URL: http://example.com --> http://example-new.com
  3. Do nothing leave it alone: http://example.com

The 3 factors

[edit]

When deciding which of the 3 results, there are 3 factors:

  • Redirects - A redirect is when a URL redirects to a different URL.
  • Soft-404s - A soft-404 is any page that contains content different from the desired content. Typically redirects to a home page.
  • Soft-redirects - A soft-redirect is when the page is live at a different URL, but there is no active redirect to the new URL.

To properly determine which of the 3 results to choose, the 3 factors need to be known ahead of time. This foreknowledge may come from other editors who inform you that a URL has moved. Or it may come through discovery, by looking at logs to see where URLs redirect to, and interpreting that information. It's a process to learn the information, codify it, and upload the results.

Process

[edit]

Process to decide the 3 results

  1. Codify any pre-known soft-redirects. These would be hard coded rules, based on foreknowledge. Thus, transform http://example.com to http://example-new.com - We'll call this the "newurl"
  2. Check newurl for redirects -- we'll call this the newloc URL ie. the "new location" URL
  3. Make a two-column table: newurl <tab> newloc
  4. Analyze the table looking for repeating instances of the same newloc in the second column. These indicate probable soft-404s.
  5. Add new rules (code) to account for the soft-404s,
  6. Re-process the links with the soft-404 rules in place.
  7. Check every URL and redirect URL for status 200 or 404.
  • If 404, then add an archive URL result #1
  • If 200, return the newurl ie. result #2 or result #3 .. depending on the value newurl

Example code

[edit]

The following pseudo-code demonstrates the steps:


origurl = "http://example.com"
newurl = sub("example.com", "example-new.com", origurl)   # Step 1 - codify known soft-redirects

(status, newloc) = networkcheck(newurl)  # status = 200, 404, etc..    .. this is Step 2 - check newurl for redirects
                                         # newloc = redirect URL

if newloc then
  print newurl "\t" location > table.txt   # Step 3 = make a two column table
endif

At this point we follow Step #4 and look at the table which might look something like this:

http://example.com/page1.htm  https://example.com
http://example.com/page2.htm  https://example.com
http://example.com/page3.htm  https://example.com/page3.htm
http://example.com/page4.htm  https://example.com
http://example.com/page5.htm  https://example.com/page5.htm

Here we see page1, page2, and page4 redirect to the home page. The others redirect to "https". So we have learned two new rules:

  • All URLS in this domain have a soft-redirect to https
  • Any URL that redirects to http://example.com is a soft-404.

So we modify the code as follows:


origurl = "http://example.com"
newurl = sub("example.com", "example-new.com", origurl)   # Step 1 - known soft-redirect
newurl = sub("http:", "https:", newurl)                   # Step 1 - known soft-redirect

(status, newloc) = networkcheck(newurl)    # status = 200, 404, etc..    .. this is Step 2 - check newurl for redirects
                                           # newloc = redirect URL

if newloc then
  
  if newloc == "https://example.com" then  # Step 5 - soft-404 
    return "404"
  newurl = newloc
  (status, newloc) = networkcheck(newurl)

endif

if status == 200 then
  return newurl
else
  return "404"
endif

Thus the above code will return:

http://example.com/page1.htm --> https://web.archive.org/web/20240601/http://example.com/page1.htm
http://example.com/page2.htm --> https://web.archive.org/web/20240601/http://example.com/page2.htm
http://example.com/page3.htm --> https://example-new.com/page3.htm
http://example.com/page4.htm --> https://web.archive.org/web/20240601/http://example.com/page4.htm
http://example.com/page5.htm --> https://example-new.com/page5.htm