Archiving Web links: Building global layers of caches and mirrors

Élan Vital 2016-06-13

The Web is highly distributed and in flux; the people using it, even moreso.

There are many projects to optimize its use:

  • Reducing storage and bandwidth:  compressing parts of the web; deduplicating files that exist in many places, replacing many with pointers to a single copy of the file [Many browsers & servers, *Box]
  • Reducing latency and long-distance bandwidth:  caching popular parts of the web locally around the world [CDNs, clouds, &c]
  • Increasing robustness & permanence of links: caching linked pages (with timestamps or snapshots, for dynamic pages) [Memento, Wayback Machine, perma, amber]


What would a really robustly backed-up Web of links look like?  What structures and patterns would it use to keep links accessible for as long as possible?  Some ideas from a discussion today:

  1. Every site should have a coat of amber: a local cached snapshot of any page linked from that site, stored on their host or a local amber supernode.  So as long as that site is available, snapshots of what it links to are, too.
    • We can comprehensively track whether sites have signalled they have an amber layer.  If a site isn’t yet caching what they link to, readers can encourage them to do so or connect them to a supernode.
    • Libraries should host amber supernodes: caches for sites that can’t host those snapshots on their host machine.
  2. Significant citations and references should include timestamped permalinks.
    • When creating a reference, sites should notify each of the major cache-networks, asking them to store a copy.
    • Permalinks should use an identifier system that allows (round-robin?) searching for the page across any of the nodes of that network, and across the different cache-networks.
  3. Snapshots of entire websites should be archived regularly
    • Both public snapshots for search engines and private ones for long-term archives.
  4. A global network of mirrors (a la LOCKSS) should maintain copies of permalink and snapshot databases
    • Consortia of libraries, archives, and publishers should commit to a broad geographic distribution of mirrors.
      • mirrors should be available within any country that has expensive interconnects with the rest of the world;
      • prioritization should lead to a kernel of the cached web that is stored in ‘seed bank‘ style archives, in the most secure vaults and other venues

What else should be included?  Those of you who have thought more carefully about this, please let me know how you see the current Web, and what your ideal networks would look like.