How to link to (Archive) a web page that might change or vanish

Mick West

Administrator
Staff member
Sometimes you want to link to a page that might change or vanish after a while, particularly if it contains outrageous bunk, or libelous material. Or it might just be something that is by it's nature temporary, like a job posting, or a service request. If you want to make sure you can reference it in the future, then a great free way of doing this is archive.is, a service that will give you both a linked snapshot of the page at that moment in time, and also a .zip version you can use offline, or attach to a post:

Just copy the URL of the page, then go to:

http://archive.today/ (formerly archive.is, as seen in images below. Currently both URLs work)

And enter in the URL (or use their bookmarklet)


It will do some magic, and you will get a page you can link to that should last for years, with:
  • Short URL
  • The date and time of the capture (UTC)
  • A link to "download zip"


For permanence, click on "download zip", and then attach the file to the post. Like I have done below.

They when you use the link, put the original link in, then archive link after it, like:
https://www.cia.gov/careers/opportunities/clandestine/ncs-language-officer.html (http://archive.is/sxL1g)
 

Attachments

  • sxL1g.zip
    409.6 KB · Views: 1,165
Last edited:
Using the bookmark bar button is by far the easiest way of using archive.is, just drag it into the bar:

How to install the bookmarklet.


Then just click on it to archive the current page.
 
I have added a new function, the "archive.is" which shows up as a button under any post you can edit.



Pressing it will archive the links externally via archive.is, and add links to those archive. Either as as plain URL next to a plain URL, or with a caret symbol (^) for inline links.



I encourage you to use this whenever you post a link to something that might vanish.

If it detects you already have an archive.is link in the post, then it will skip all links. So if you add links after the first archiving, you should either remove the old archive links and re-run, or create the archive links manually, as above.

Example:
http://xenforo.com (http://archive.is/Sbihm)

Direct link^
 
Last edited:
I've been reading through older material on the site the last couple weeks and noticed a fair number of dead youtube links. Does this archive the actual youtube video as well? (If that's what's linked)
 
I've been reading through older material on the site the last couple weeks and noticed a fair number of dead youtube links. Does this archive the actual youtube video as well? (If that's what's linked)

No. In fact it will just skip over YouTube videos.

Videos are very large. It's not economic to cache them. The only sensible way would be to download it, and then re-upload it to youtube (or Vimeo, etc).
 
How To Save URLs To The Wayback Machine On Demand


The Wayback Machine: Your Own Web Archiver
Basically, simply cut and paste the URL of a web page or PDF and the Wayback crawler will archive and index the material and provide you with a direct url to it in real-time.

You’ll find a box to paste the URL into on the Wayback homepage. It’s labeled “Save Page Now.”



Once the crawling and indexing is complete, a URL to the archived copy will either be provided in a pop-up box or — if archiving a PDF file — it will be found in the location bar.



There is no cost to use this feature and with it you can be assured the page/PDF you saw is available at a later date. At the same time, you’ve also helped make the Wayback Machine more comprehensive from all users.
Content from External Source
 
I've just added code to also submit all new links to archive.org, although increasingly it seems like they are crawling the web so fast that they get most new stuff on existing sites within a day or two. So it is most useful for new sites that crop up and are not yet being indexed.
 
I've just added code to also submit all new links to archive.org, although increasingly it seems like they are crawling the web so fast that they get most new stuff on existing sites within a day or two. So it is most useful for new sites that crop up and are not yet being indexed.
just wondering. do archived pages show up in google searches?

Mostly I'm wondering if a 'hoax bunk' page is taken down the information disappears from the internet unless it happened to be crawled. but if we archive it... you can't unarchive something right?
 
just wondering. do archived pages show up in google searches?

Mostly I'm wondering if a 'hoax bunk' page is taken down the information disappears from the internet unless it happened to be crawled. but if we archive it... you can't unarchive something right?

You can't unarchive an archive.is/archive.today page, AFAIK, however you CAN unarchive an archive.org page.

For something that is very important, I'd recommend downloading a local copy of the page.
 
I used to work for a very huge website, and we found Wayback Machine had big gaps of our material. There's just so much they can crawl, not having Google's resources. We also had some private pages, known to many users but deliberately blocked to all crawlers. We'd periodically place the more long-term important documents from there into WBM by Critical Thinker's method, to save them from our own company's overzealous pruning of "obsolete" content.
 
We'd periodically place the more long-term important documents from there into WBM by Critical Thinker's method

The code I added essentially uses this same method, just programmatically grabs the URL:

http://web.archive.org/save/{the url you want to save}

And archive.org should grab the latest version of the page.
 
just wondering. do archived pages show up in google searches?

Google re-crawls the same things on a regular basis, on some schedule based other judgement of the importance, interest and freshness of the content. When they find something they've seen before, they make their own cached copy of it. When they have done this, there is a tiny down arrow at the end of the green URL under the search result link, which lets you access the cached copy.

If a page is not available when they go to crawl it again, if it is a "soft" error (300 or 500 series error code) they will keep trying. If it is a "hard" error code (the 400 series, 404 the best known) they correctly assume the content is gone forever, so they remove the search result link for it. This of course means there is nothing to hang their cached link onto. This is because it is bad for their business to provide dead links.

Where do they get dead links to try and crawl? from other sites that have linked to them. Like this site. They will follow and try to index every link they see here, then treat them as above, if they find them dead. If you post two links here, the original and your WBM version, they will tr to index both.

Google breathes data, so I suspect they don't actually discard that cache, but there's no way for us to get hold of it. They will continue to see and display URLs from WayBack Machine, which are WBM URLs, not the same as the original was. So anything you put in there is available forever, that is what WBM is all about.

...

Re unarchiving, it's probably ok if you, who submitted something, can later remove it. It defeats the purpose if they let the original owner get it removed (although they'd have to if they get a legal take-down notice, i guess). But as you say, something terribly important is best "saved as HTML" to your own computer, from where you can get it back onto the internet (or other interested parties) again if necessary. I would not consider google docs, github etc safe places for such file sharing.
 
Last edited by a moderator:
I think it's a good idea, at least with facebook to check the archive, when I first archived this one from the MET Office Page, https://archive.today/P1vM4, even though I had all of the comments expanded, it only showed Ian's opening comment. Several days later, when Ian said he saw no encyclopedia entry (I assume he has that person blocked), I archived it a second time, https://archive.today/fpFvX , this time it showed all of the replies and apparently, Ian was able to see the photos from the encyclopedia.
 
I think it's a good idea, at least with facebook to check the archive, when I first archived this one from the MET Office Page, https://archive.today/P1vM4, even though I had all of the comments expanded, it only showed Ian's opening comment. Several days later, when Ian said he saw no encyclopedia entry (I assume he has that person blocked), I archived it a second time, https://archive.today/fpFvX , this time it showed all of the replies and apparently, Ian was able to see the photos from the encyclopedia.

Yes, Facebook can be a bit fiddly for the archiver, as it has dynamic content, an it varies based on who is viewing it, and probably other factors. The archive.is robot uses a dummy user, and is logged in, but would not be a member of any of the groups, so might not be able to get everything.

If it's important, take a screenshot.
 
Another quirk with Facebook is, it only grabs the last 50 posts made, so archiving frequently in long threads is key.
 
Back
Top