Web Archiving services at the National Archives: Notes from a Webinar - by Clare Brown
Clare believes a satisfying career should be intellectually stimulating and constantly mutating, so thankfully she opted for a life in information provision. For 20+ years she implemented, improved and managed information services within commercial, legal and governmental contexts. But the most interesting career challenge to date was her recent move to the supplier/tech side. Clare can be contacted at: https://www.linkedin.com/in/clareangelabrown/ & www.vable.com
Webinar: "Web archiving services at the National Archives" (Wednesday 3 February 2021)
Speaker: Tom Storrar - Web
Archiving Service Owner at The National Archives
Chair: Fiona Laing (currently Chair of SCOOP – Standing Committee on Official Publication)
Do you remember
when we used to pay educational visits to physical archives? These events were
always a privilege; to go behind the scenes and breathe in the organisational
magic of rolling stacks and special storage units. When I read that Tom
Storrar, Web Archiving Service Owner at The National Archives (TNA), was
presenting a webinar, I immediately signed up and was excited to join CILIP GIG
colleagues online.
We weren’t
disappointed. Tom and his colleagues (7 full-time and one part-time) have the
important role of officially preserving the UK government’s online material.
Technology has often run ahead of government, which has left researchers
stranded. The issue of missing or inaccessible online government information
has caused problems in the past, and was raised in parliament, for instance in
2006 and 2009:
Mr. Heald: To
ask the Minister of State, Department for Constitutional Affairs what steps she
is taking to ensure the long-term preservation of documents held in digital
form.
Ms Harman: The
National Archives is working with the Government’s Chief Technical Officers
(CTO) Council to address the problem of the survival of electronic records with
a mid and long-term value across Government. The National Archives has
implemented a Digital Preservation Programme to ensure the long-term
preservation of documents held in digital form. It has established a Digital
Archive facility, in which it preserves a wide range of electronic records
transferred by Government departments;
- a Web Archiving Programme to preserve government websites of long-term value;
- the National Digital Archive of Datasets to preserve historically significant datasets created over the past thirty years.
Mr. Gordon
Prentice: What is the Government’s policy on archiving online information? I
have found that some of the most interesting material that I want to get hold
of has been archived and is inaccessible.
Mr. Watson: My
hon. Friend raises an important point. In fact, we do not have access to the
first ever Government website, which came out in 1994; the technology to track
it down is not available. It was, of course, the website of a previous
Government. My hon. Friend will be pleased to know that the website
rationalisation programme will have closed some 1,500 websites by 2011. The
National Archives are taking the lead on that, so that important information
that my hon. Friend and future generations need to find will be accessible for
generations to come.
What is the story of the UK Government Web Archive?
Even in 2006 the
National Archives were on the case: in 2003 they were preserving a small
selection of UK central government websites. By 2008 the scope had been
expanded, with 2012 seeing the addition of social media. In 2017 they switched
their service provider to MirrorWeb Ltd, which gives the archiving
team a technical edge, for instance, being able to improve the site’s search
functionality.
It comprises
more than 40,000 crawls/snapshots of over 5000 websites and over 500 social
media accounts. It is approximately 160tb in size, 6 billion resources and an
important tool for contextualising records for past, present and future
research. After all, in the future someone might write a PhD exploring the
influence of government advertising on the sales of beef and lamb!
What is web archiving, and why is it important?
Tom agrees with
the Wikipedia definition of web archiving,
which says,
Web archiving is the process of collecting portions of the World Wide Web
to ensure the information is preserved in an archive for future researchers,
historians, and the public. Web archivists typically employ web crawlers for
automated capture due to the massive size and amount of information on the Web.
One of the first
experiences that made me question my librarian abilities occurred in May 1997,
just after the general election. The change in government meant that the
embryonic government websites on which we were starting to rely vanished
overnight. I instantly regretted my lack of foresight and wished we’d printed
out all that online material.
As a consequence
of this, our law library created paper files of government press releases,
guidance notes, manuals, reports, white paper etc. which had to be updated
daily - the role of the junior team member. When I see those government website
iterations from the late 1990’s, it brings back many filing memories!
What do the National Archives
capture?
We have come a
long way since those classic static sites, and the advent of government
broadcast on social media - Twitter and Flickr - means that even more
information needs to be captured and archived. Web archiving operates within
certain technical constraints and for it to be archived, content must be:
●
Publicly available
●
Reachable by robots/crawlers
If the web
archiving team is informed about web resources that do not meet the above
criteria, they can intervene and capture it using state of the art tools such
as Conifer.
The captured pages then undergo quality assurance checks before they are
published out - occasionally rogue code causes readability issues but is
inevitable when you are dealing with this amount of data, and a wide variety of
web sources.
Currently more
than 800 distinct websites and social media accounts are regularly archived:
those of central government, departments, and other public bodies, hubsites
(e.g. GOV.UK, NHS, public inquiries, and some inquests. They take as much as
possible from the target website:
●
Publications, datasets, documentation
●
Images
●
Video, animations
Their approach
is to take a “deep” and complete captures of every website they archive with an
emphasis on quality, completeness, and fidelity. Obviously there are
circumstances when pages need to be taken down; when there are errors,
non-governmental material, or something is subject to data protection -
otherwise everything is conserved.
Special event-based archiving
projects
It would be
impossible - and unnecessary - to capture everything, everyday so they have a
schedule for regular material. However some captures are triggered when a
website is about to be retired, refreshed, or redesigned. Events of national
importance such as Brexit (EEWA) have become web archive projects in their own
right and have required daily web archiving.
Tom outlined
their three-pronged approach:
- They increased the frequency of captures of key sites and resources
- They supplemented the frequency with (weekly and fortnightly) keyword-generated broader crawls across the government web estate
- They captured daily snapshots of complex, interactive (forms etc.), or very fast-changing content using web crawlers and/or Conifer
The search
function is vital because the archive is huge and these projects are important.
You have several options - a specific social media search; the general search, a fascinating A-Z, a URL search, and the Discovery
catalogue style search. Discovery holds more than 32 million
descriptions of records held by TNA and more than 2,500 archives across the
country.
How people are coming together to
help the National Archives
Web archiving is
the responsibility of every government department and it can only be done with
the assistance and cooperation of other people. TNA ask departments to:
●
Make sure that their content is “crawlable” - there is
technological guidance
●
Provide XML sitemap(s), especially for content behind
inaccessible functionality
●
Ensure that the website’s copyright and reuse statement
is clear
● Review the takedown policy, and check that existing
archived content is within the rules
●
Consider archiving timescales and remember that it is
not an instantaneous process
●
And check the capture before retiring, pruning, taking
down or deleting a website!
The official
status of TNA makes inter-departmental relationship building easier and to
raise their profile, they hold events and host webinars. Should people need
advice on tech, copyright, or new (or old!) websites, then they are invited to
get in touch with the National Archive team.
The EU Exit Web Archive (Brexit
archives)
Part 1 of
schedule 5 to the European Union
(Withdrawal) Act 2018, places a statutory obligation on the National
Archives to make arrangements for the publication of EU legislation relevant to
the UK after exit. They have created an EU Exit Web Archive, which contains a
wide selection of documents taken from EUR-Lex - all relevant content published
up to the completion of the implementation period at 11pm GMT on 31st December
2020.
This archive is
designed to be a comprehensive and official UK reference point for EU law as it
stood at the UK’s exit. To get a complete picture, they archived/harvested more
information than was necessary, for instance case law. It is a tool for
supporting legal certainty, “showing our working”, identifying provenance and
contextualisation for legislation.gov.uk. Legal researchers will already be
using this archive!
Anyone who has
used the EUR-lex website has an
appreciation for its complexity which is partly why the data capture was
such a challenge. They developed a novel data driven archiving approach, which
ensured complete coverage for all document types. They maintained a high level
of quality assurance so that capture was as precise as possible.
They made use of
EUR-Lex’s existing metadata so that they could separate all iterations of the
website, depending on when the content was published or last modified. They
completed 13 of these iterations, all the way up to 11pm GMT on 31st December
2020. This resulted in an archive of over 30 million resources. They continue
to work on search functionality.
I have enjoyed
browsing the government archives and it is nice to remember some of the best Whitehall
feline civil
servants and revisit happier times.
Comments
Post a Comment