Web Archiving services at the National Archives: Notes from a Webinar - by Clare Brown

Clare believes a satisfying career should be intellectually stimulating and constantly mutating, so thankfully she opted for a life in information provision. For 20+ years she implemented, improved and managed information services within commercial, legal and governmental contexts.  But the most interesting career challenge to date was her recent move to the supplier/tech side. Clare can be contacted at:  https://www.linkedin.com/in/clareangelabrown/  & www.vable.com


Webinar: "Web archiving services at the National Archives" (Wednesday 3 February 2021)

Speaker: Tom Storrar - Web Archiving Service Owner at The National Archives

Chair: Fiona Laing (currently Chair of SCOOP – Standing Committee on Official Publication)




 


Do you remember when we used to pay educational visits to physical archives? These events were always a privilege; to go behind the scenes and breathe in the organisational magic of rolling stacks and special storage units. When I read that Tom Storrar, Web Archiving Service Owner at The National Archives (TNA), was presenting a webinar, I immediately signed up and was excited to join CILIP GIG colleagues online.

 

We weren’t disappointed. Tom and his colleagues (7 full-time and one part-time) have the important role of officially preserving the UK government’s online material. Technology has often run ahead of government, which has left researchers stranded. The issue of missing or inaccessible online government information has caused problems in the past, and was raised in parliament, for instance in 2006 and 2009:

 

Mr. Heald: To ask the Minister of State, Department for Constitutional Affairs what steps she is taking to ensure the long-term preservation of documents held in digital form.

 

Ms Harman: The National Archives is working with the Government’s Chief Technical Officers (CTO) Council to address the problem of the survival of electronic records with a mid and long-term value across Government. The National Archives has implemented a Digital Preservation Programme to ensure the long-term preservation of documents held in digital form. It has established a Digital Archive facility, in which it preserves a wide range of electronic records transferred by Government departments;


  • a Web Archiving Programme to preserve government websites of long-term value;

  • the National Digital Archive of Datasets to preserve historically significant datasets created over the past thirty years.

 

Mr. Gordon Prentice: What is the Government’s policy on archiving online information? I have found that some of the most interesting material that I want to get hold of has been archived and is inaccessible.

 

Mr. Watson: My hon. Friend raises an important point. In fact, we do not have access to the first ever Government website, which came out in 1994; the technology to track it down is not available. It was, of course, the website of a previous Government. My hon. Friend will be pleased to know that the website rationalisation programme will have closed some 1,500 websites by 2011. The National Archives are taking the lead on that, so that important information that my hon. Friend and future generations need to find will be accessible for generations to come.





What is the story of the UK Government Web Archive?

Even in 2006 the National Archives were on the case: in 2003 they were preserving a small selection of UK central government websites. By 2008 the scope had been expanded, with 2012 seeing the addition of social media. In 2017 they switched their service provider to MirrorWeb Ltd, which gives the archiving team a technical edge, for instance, being able to improve the site’s search functionality.

 

It comprises more than 40,000 crawls/snapshots of over 5000 websites and over 500 social media accounts. It is approximately 160tb in size, 6 billion resources and an important tool for contextualising records for past, present and future research. After all, in the future someone might write a PhD exploring the influence of government advertising on the sales of beef and lamb!


What is web archiving, and why is it important?

Tom agrees with the Wikipedia definition of web archiving, which says, 

 

Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ web crawlers for automated capture due to the massive size and amount of information on the Web.

 

One of the first experiences that made me question my librarian abilities occurred in May 1997, just after the general election. The change in government meant that the embryonic government websites on which we were starting to rely vanished overnight. I instantly regretted my lack of foresight and wished we’d printed out all that online material.

 

As a consequence of this, our law library created paper files of government press releases, guidance notes, manuals, reports, white paper etc. which had to be updated daily - the role of the junior team member. When I see those government website iterations from the late 1990’s, it brings back many filing memories!





What do the National Archives capture?

We have come a long way since those classic static sites, and the advent of government broadcast on social media - Twitter and Flickr - means that even more information needs to be captured and archived. Web archiving operates within certain technical constraints and for it to be archived, content must be:

 

       Publicly available

       Reachable by robots/crawlers

 

If the web archiving team is informed about web resources that do not meet the above criteria, they can intervene and capture it using state of the art tools such as Conifer. The captured pages then undergo quality assurance checks before they are published out - occasionally rogue code causes readability issues but is inevitable when you are dealing with this amount of data, and a wide variety of web sources.

 

Currently more than 800 distinct websites and social media accounts are regularly archived: those of central government, departments, and other public bodies, hubsites (e.g. GOV.UK, NHS, public inquiries, and some inquests. They take as much as possible from the target website:

 

       Publications, datasets, documentation

       Images

       Video, animations

 

Their approach is to take a “deep” and complete captures of every website they archive with an emphasis on quality, completeness, and fidelity. Obviously there are circumstances when pages need to be taken down; when there are errors, non-governmental material, or something is subject to data protection - otherwise everything is conserved.


Special event-based archiving projects

It would be impossible - and unnecessary - to capture everything, everyday so they have a schedule for regular material. However some captures are triggered when a website is about to be retired, refreshed, or redesigned. Events of national importance such as Brexit (EEWA) have become web archive projects in their own right and have required daily web archiving.

 

Tom outlined their three-pronged approach:

  • They increased the frequency of captures of key sites and resources
  • They supplemented the frequency with (weekly and fortnightly) keyword-generated broader crawls across the government web estate
  • They captured daily snapshots of complex, interactive (forms etc.), or very fast-changing content using web crawlers and/or Conifer

 

The search function is vital because the archive is huge and these projects are important. You have several options - a specific social media search; the general search, a fascinating A-Z, a URL search, and the Discovery catalogue style search. Discovery holds more than 32 million descriptions of records held by TNA and more than 2,500 archives across the country.


How people are coming together to help the National Archives

Web archiving is the responsibility of every government department and it can only be done with the assistance and cooperation of other people. TNA ask departments to:

 

       Make sure that their content is “crawlable” - there is technological guidance

       Provide XML sitemap(s), especially for content behind inaccessible functionality

       Ensure that the website’s copyright and reuse statement is clear

      Review the takedown policy, and check that existing archived content is within the rules

       Consider archiving timescales and remember that it is not an instantaneous process

       And check the capture before retiring, pruning, taking down or deleting a website!

 

The official status of TNA makes inter-departmental relationship building easier and to raise their profile, they hold events and host webinars. Should people need advice on tech, copyright, or new (or old!) websites, then they are invited to get in touch with the National Archive team.


The EU Exit Web Archive (Brexit archives)

Part 1 of schedule 5 to the European Union (Withdrawal) Act 2018, places a statutory obligation on the National Archives to make arrangements for the publication of EU legislation relevant to the UK after exit. They have created an EU Exit Web Archive, which contains a wide selection of documents taken from EUR-Lex - all relevant content published up to the completion of the implementation period at 11pm GMT on 31st December 2020.

 

This archive is designed to be a comprehensive and official UK reference point for EU law as it stood at the UK’s exit. To get a complete picture, they archived/harvested more information than was necessary, for instance case law. It is a tool for supporting legal certainty, “showing our working”, identifying provenance and contextualisation for legislation.gov.uk. Legal researchers will already be using this archive!

 

Anyone who has used the EUR-lex website has an appreciation for its complexity which is partly why the data capture was such a challenge. They developed a novel data driven archiving approach, which ensured complete coverage for all document types. They maintained a high level of quality assurance so that capture was as precise as possible.

 

They made use of EUR-Lex’s existing metadata so that they could separate all iterations of the website, depending on when the content was published or last modified. They completed 13 of these iterations, all the way up to 11pm GMT on 31st December 2020. This resulted in an archive of over 30 million resources. They continue to work on search functionality.

 

I have enjoyed browsing the government archives and it is nice to remember some of the best Whitehall feline civil servants  and revisit happier times. 

 

 

 

 

 


Comments