British Library Web Archiving - Webinar report by Luca Fois
Luca Fois is Librarian for the Sheriff Courts Library Service. He has prior experience in public and school libraries, and has worked in Government/Legal sector for the past 2 years. His main interests are user engagement, data analysis and visualisation, and literacies.
Presenter: Jason Webber
- Web Archiving Engagement Manager – British Library
Chair: Fiona Laing -
Official Publications Curator - National Library of Scotland
Image source: www.bl.uk |
The UK Web archive (UKWA) is a preservation
project born from the collaboration among the six UK Legal Deposit Libraries:
Bodleian Libraries, British Library, Cambridge University Library, National
Library of Scotland, National Library of Wales and Trinity College Dublin. Main
aim of this project is to make a snapshot of the corpus of UK websites at
various points in time so that they can be retrieved and accessed in the
future, as they were at the moment of preservation, even if the site is taken
down or modified in any way. All the contents in the preserved materials, i.e.
attached files or videos, should also be preserved and working as at time of
preservation.
As mentioned, the UKWA aims to collect the
entirety of the UK web pages. This is per se a very challenging task. Not all
UK websites’ domains end with .uk, .scot, .wales, .cymru or .london, and some
pages are hosted under domains that might not be traceable back to the UK (like
a .com domain). For these reasons UKWA also collects all websites that are
hosted on a server located in the UK, as well as all websites that contain UK
addresses; or websites for which the owner confirms that they are resident in
the UK or their business is with the UK.
The project started in 2005 and back then
the website could be included in the archive only when the owner agreed to the
preservation of the website. Method of collection changes in 2013 due to legal
changes brought forth by the The Legal Deposit Libraries (Non-Print Works) Regulations
2013. The change in regulation meant that from the roughly 20,000
websites collected using the permission method, UKWA started preserving
millions of records, and the number of records increases daily, as the amount
of content created daily. The preservation of the ephemeral webpage contents is
done through very tangible storage nodes located in four different locations
that are constantly in connection to each other, checking that the material is
functional, accessible and if needed, is repaired and backed-up in all
locations. In case you would require or would like to know more about technical
aspects of the project, there is also an all-encompassing faq
page.
There are different methods to collect
websites. An annual domain crawl collects automatically all the websites
recognised as UK based according to the criteria mentioned above plus the
websites that are referred by users: you can also refer any website you deem
worthy collecting to the UKWA using this form. News or social media websites are
collected on a more regular basis, and this also applies to websites whose
content is changed frequently and whose changes are impactful for the reader or
society at large, like the mygov.scot website.
The access to the service is dependent on
the method of collection: when the author of the website has given permission
for preservation, usually the website is accessible through the link as it is
shown on the UKWA search result page. All the remaining websites can be
accessed only if using one of the 9 terminals available in the participating
legal deposit libraries. This is an unfortunate limitation,
especially considering the current restrictions due to the pandemic.
The most interesting feature for exploring
the huge corpus of the collection are the curated topics and themes sections.
This requires a lot of on-hand work by the curator of the collections, but the
results allow users to have an easier way to approach various topics. Having
had a look at the website after the webinar, I explored the collection on LGBTQ+ websites, as well as the one on the Scottish Independence Referendum. For the
latter, all the material preserved in the archive would be completely lost if
not saved in the UKWA servers. It gives a complete image of all the discourse
on the Scottish Independence as it was happening on the internet during the
election period. For the former, I want to note that websites reporting
controversial opinions on the LGBTQ+community are also saved, which will prove
invaluable to researchers or historians when looking back to the modality of
conversation around these topics in the future. I thoroughly appreciate that,
for the benefit of the user, the UKWA put a disclaimer mentioning that the
collection “may contain websites that may be considered transphobic or
homophobic. It is a principle of [their] collection development policy, underpinned
by Legal Deposit Regulations, to include, uncensored, everything that is
published online in the UK for the benefit of future researchers.”
Image source: www.bl.uk |
In addition to the presentation of the
website ‘as preserved’, UKWA has also made available a wealth of secondary
datasets. To access these you will need to use a dedicated website, webarchive.org.uk/shine.
There are two main tools which allow a researcher to carry a textual analysis
on the 3.5 billion indexed items collected between 1996 and April 2013: a
search function, which looks for the words in the body of the website, and a
trend function. Unfortunately, the search results are not sorted by relevance
and further filtering and more complex research queries need to be used to find
what you are looking for. The trend function looks at the frequency in which a
word or a set of words appears in the collection of websites in a particular
time frame and it uses a more visual approach. This has been interesting to
use. Considering the vastness of the collection some research takes a while to
complete.
I have
tried a quick research looking for the occurrences of the expressions “library
cuts'' and “library funding” in the body of text: interesting to note how the
usage of the former expression increased towards the end of the period
available with a peak in 2011, while the use of “library funding” decreased
with time with the minimum occurrence in 2013. Another research I have done is
on the frequency of the terms lesbian, gay, bisexual and transgender in the
corpus. I was not surprised to see how there has been a peak of occurrences for
the words lesbian and gay in 2000, while transgender and bisexual are just
lurking at the bottom of the graph, with a very low percentage of occurrences;
this gives an opportunity for reflection in term of the different
representation of the people in the LGBTQ+ community. However, most of the
occurrences come from amazon.co.uk, so there is certainly more exploration to be
done on this topic, which falls outside the scope of this short blog post.
These are just a few simple examples, but the possibilities for research are
really endless! On top of this, it is also possible to filter research by
postcode, or date or website, so just go and play with the tools if you are
interested.
This
project is already invaluable for all the legal deposit libraries preserving
the potentially infinite production of online material which are lost to the
passage of time. Historians, researchers and people with different interests
will all be able to consult the recorded websites to have a better
understanding and a clear picture of many trends or societal changes at the
time they were reported on the internet.
Comments
Post a Comment