BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Are Web Archives Failing The Modern Web: Video, Social Media, Dynamic Pages and The Mobile Web

Following
This article is more than 7 years old.

The Internet Archive is perhaps the most famous archive of the World Wide Web's evolution and history, founded 20 years ago by Brewster Kahle when he realized that the web, despite its pivotal role in reshaping how human society accessed and engaged with the world around us, was by its nature ephemeral and being lost with each passing day. He founded the Archive to preserve the early Internet over its formative years, acting as the internet equivalent of a library archive, accepting donations of crawling data and performing its own crawls and amassing all of this into a single digital catalog. The success of the Archive helped spur a myriad web archiving initiatives across the globe today, focused on everything from national culture to scientific data.

However, in just its 24 brief years, the modern web has evolved with breathtaking speed from a simple (largely textual) platform for sharing scientific research into a rich multimedia and increasingly intelligent network that seeks to connect every person on earth. This state of constant change has resulted in an Internet that has largely run ahead of much of the web archiving community, meaning that our archives are preserving less and less of the Internet even as that Internet powers more and more of the world around us.

For the most part, a web archiving crawler built 20 years ago could still largely function today, downloading a web page, extracting its links, crawling each of those links, extracting their links in turn, crawling those links and so on, while recording each page's HTML and images into its archives. Styling like CSS might be missed by those early crawlers, though well-designed ones built for arbitrary resource identification would still function today, albeit not as efficiently.

The problem is that the web is no longer built upon the simple premise of a collection of small static HTML and image files served up with a simple tag structure and readily parsed with a few lines of code. Today’s web is richly dynamic, multimedia and increasingly broken into walled gardens and device-specific parallel webs.

In particular, the web has been developing along four key evolutionary paths that have proved particularly problematic for the archiving community to preserve: multimedia, social media, dynamic content and the mobile web.

The web of today is a far cry from the web of 1995 when I launched my first web startup, when web pages were essentially like book pages – for the most part piles of text with an odd image thrown in here and there for illustrative effect. Today the web is all about streaming video and audio. Even 4K videos are beginning to increase in numbers on YouTube and other streaming sites. Multimedia is difficult to archive not only because of its size (its quite easy to accumulate a few petabytes of HD video without much difficulty), but also because most streaming video sites don’t make it easy to download the original source files. While numerous utilities exist that are able to reverse the streaming protocols used by major video hosting sites, the sites themselves rarely offer officially sanctioned APIs for bulk downloading large volumes of their content as raw video source files. In our device-centric world in which we watch videos from large-format televisions, ultra resolution desktops, low resolution phones, etc it is also important to recognize that streaming sites typically offer multiple versions of a video in different resolutions and compression levels that can result in dramatically different viewing experiences. The majority of video archiving solutions today focus on just the default version of a video or the highest resolution version, rather than attempting to archive all editions of a stream. Some platforms also go to extended lengths to try and prevent unauthorized downloading of their content via special encodings, encryption and other protections.

Archiving streaming video is not an unsolved problem and widely used tools like the Internet Archive’s Archive-IT system actually include support for platforms like YouTube right out of the box. Yet, support for streaming video is actually fairly rare among the myriad web archiving projects underway today – the vast majority of them are unable to properly download and archive a YouTube video and preserve it for posterity as part of their core crawling activity.

Social media offers perhaps the most intractable challenge to web archiving by virtue of the walled gardens being erected by the major social platforms. While Twitter has long offered a firehose of all of its public tweets, which is in fact archived by the Library of Congress, Facebook and many other platforms do not offer commercial data firehoses that archivers can simply plug into. Moreover, outside of Twitter nearly all major social platforms are moving towards extensive privacy settings and default settings that encourage posts to be shared only with friends. The trend today is no longer to broadcast one’s every waking moment to the world, but rather to share intimate thoughts with friends and family. This means that even if companies like Facebook decided to make available a commercial data stream of all public content across the entire platform, the stream would capture only a minuscule fraction of the daily life of the platform’s 2 billion users.

From a web archival standpoint, the major social media platforms are largely inaccessible for archiving. While tools exist to assist in bulk exporting posts from Facebook, the platform continually adapts its technical countermeasures and has utilized legal threats in the past to discourage bulk downloading and distribution of user data. Shifting social norms around privacy mean that regardless of technological or legal countermeasures, users are increasingly walling off their data and making it unavailable for the public access needed to archive it. In short, as social media platforms wall off the Internet, their new private parallel Internets cannot be preserved, even as society is increasingly relying on those new walled gardens to carry out daily life.

The dynamic web poses unique challenges to the simplistic crawlers used for many web archiving projects. For example, the CNN homepage uses JavaScript to render the majority of the page. A simplistic crawler that simply fetches a page and parses its static HTML as-is will fail to download or preserve the majority of the page. Indeed, after CNN introduced its first iteration of its dynamic homepage in April 2015 a number of web archives ceased preserving anything other than the above-the-fold headlines – the rest of the homepage simply ceased to exist. When CNN rolled out another update to its homepage sometime in November 2016, some web archives simply began displaying a blank page for all snapshots over the last four months.

This is because many web archiving projects today use crawlers built for the web of a quarter century ago, rather than the web of today. Many of the archival crawlers I’ve seen are extraordinarily simplistic, lacking any of the efficiency and stability enhancements standard on today’s commercial production crawlers. Many are simply cobbled together Python scripts or Java applications that are less robust than crawlers I wrote 23 years ago. Many archival crawlers expect static HTML pages where the entirety of the page is contained in a single HTML response that can be processed as-is in isolation. Few incorporate refinements like JavaScript execution engines (such as Google’s V8 engine) or full page rendering and DOM crawling and thus have no possibility of rendering modern dynamic pages.

In contrast, Google’s own crawlers appear to have supported basic JavaScript rendering at least as early as 2011 and by 2015 they appear to have been fully rendering dynamically generated content via JavaScript inside of the crawler and compiling the indexed version of each page via DOM traversal. This means Google’s crawlers “see” pages the same way a modern web browser does and therefore have no issues with dynamic content like the CNN homepage.

Building Google-style dynamic crawlers with inbuilt JavaScript support is actually not that difficult, especially with the availability of Google’s V8 engine and JavaScript-first environments like Node.js. Scaling such crawlers to crawl the open web and process billions of pages efficiently is a different matter, but hybrid approaches such as using scout crawlers to identify sites using dynamic rendering or filters designed to identify dynamic pages and recrawling those pages using a V8-powered crawler can act as a useful bridge.

No matter what approach they choose, the simple fact of the matter is that the era of using traditional static HTML web crawlers for archival work has ended. A web archive that uses crawlers that cannot render dynamic JavaScript-powered web pages simply cannot robustly access and preserve the increasingly dynamic web.

Finally, if we think of the inability to index dynamic content as preventing web archives from preserving the dynamic web, then the failure of many web archives to consider mobile content is preventing web archives from preserving the mobile web. Over the last few years Internet users have increasingly turned to mobile devices from cellphones to tablets to access the Internet. From early mobile-optimized sites to today’s mobile-first world, the Internet of today is gradually leaving its desktop roots behind. Google has been a powerful force behind this transition, penalizing sites that do not offer mobile versions.

Yet, many of the web archives I’ve perused fail to robustly index this parallel web. Few actively scan pages for tags indicating the availability of AMP or mobile editions and automatically crawl and index those. Even those that do look for AMP pages do not always switch to a mobile user agent and mobile emulation to fetch those editions. An increasing number of servers scan the user agent field and deny access to the mobile edition of a page unless the client is an actual mobile device, meaning an ordinary crawler requesting a mobile page, but using its standard desktop user agent tag will simply be redirected to the desktop version of the page. Some sites go even further, returning versions of the site tailored for tablets versus smartphones and even targeting specific devices for truly customized user experiences, requiring multiple device emulation to fully preserve a page in all its forms.

Adding mobile web support to web archives is fairly trivial, but it is remarkable how few archives have implemented complete robust mobile support. Even those that offer basic mobile crawling support rarely crawl all versions of a page to test for how differences in device and screen capabilities affect the returned content and the level of dynamic customization in use.

Putting this all together, the incredible vision of preserving the web for future generations led to myriad projects today tasked with crawling and saving copies of the world’s websites. However, with a few exceptions the web archiving community is still stuck in a quarter-century-old mindset of how the web works and has largely failed to adapt to the rapidly evolving world of video, social media walled gardens, dynamic page generation and the mobile web. Some of these have no easy answers while others are trivial to address, but both suggest that greater collaboration is needed between the archiving community and the broader technology industry, especially companies that build the state-of-the-art crawling infrastructures that power modern web services. In the end, to truly preserve today’s web requires a lot more than the simple crawlers that sufficed 24 years ago.