British Library Web Archiving - Webinar report by Luca Fois

 Luca Fois is Librarian for the Sheriff Courts Library Service. He has prior experience in public and school libraries, and has worked in Government/Legal sector for the past 2 years. His main interests are user engagement, data analysis and visualisation, and literacies.


British Library WebArchiving

Presenter: Jason Webber - Web Archiving Engagement Manager – British Library

Chair: Fiona Laing - Official Publications Curator - National Library of Scotland

 


British Library exterior at Saint Pancras, London
Image source: www.bl.uk



The UK Web archive (UKWA) is a preservation project born from the collaboration among the six UK Legal Deposit Libraries: Bodleian Libraries, British Library, Cambridge University Library, National Library of Scotland, National Library of Wales and Trinity College Dublin. Main aim of this project is to make a snapshot of the corpus of UK websites at various points in time so that they can be retrieved and accessed in the future, as they were at the moment of preservation, even if the site is taken down or modified in any way. All the contents in the preserved materials, i.e. attached files or videos, should also be preserved and working as at time of preservation.


As mentioned, the UKWA aims to collect the entirety of the UK web pages. This is per se a very challenging task. Not all UK websites’ domains end with .uk, .scot, .wales, .cymru or .london, and some pages are hosted under domains that might not be traceable back to the UK (like a .com domain). For these reasons UKWA also collects all websites that are hosted on a server located in the UK, as well as all websites that contain UK addresses; or websites for which the owner confirms that they are resident in the UK or their business is with the UK.


The project started in 2005 and back then the website could be included in the archive only when the owner agreed to the preservation of the website. Method of collection changes in 2013 due to legal changes brought forth by the The Legal Deposit Libraries (Non-Print Works) Regulations 2013. The change in regulation meant that from the roughly 20,000 websites collected using the permission method, UKWA started preserving millions of records, and the number of records increases daily, as the amount of content created daily. The preservation of the ephemeral webpage contents is done through very tangible storage nodes located in four different locations that are constantly in connection to each other, checking that the material is functional, accessible and if needed, is repaired and backed-up in all locations. In case you would require or would like to know more about technical aspects of the project, there is also an all-encompassing faq page.


There are different methods to collect websites. An annual domain crawl collects automatically all the websites recognised as UK based according to the criteria mentioned above plus the websites that are referred by users: you can also refer any website you deem worthy collecting to the UKWA using this form. News or social media websites are collected on a more regular basis, and this also applies to websites whose content is changed frequently and whose changes are impactful for the reader or society at large, like the mygov.scot website.


The access to the service is dependent on the method of collection: when the author of the website has given permission for preservation, usually the website is accessible through the link as it is shown on the UKWA search result page. All the remaining websites can be accessed only if using one of the 9 terminals available in the participating legal deposit libraries. This is an unfortunate limitation, especially considering the current restrictions due to the pandemic.


The most interesting feature for exploring the huge corpus of the collection are the curated topics and themes sections. This requires a lot of on-hand work by the curator of the collections, but the results allow users to have an easier way to approach various topics. Having had a look at the website after the webinar, I explored the collection on LGBTQ+ websites, as well as the one on the Scottish Independence Referendum. For the latter, all the material preserved in the archive would be completely lost if not saved in the UKWA servers. It gives a complete image of all the discourse on the Scottish Independence as it was happening on the internet during the election period. For the former, I want to note that websites reporting controversial opinions on the LGBTQ+community are also saved, which will prove invaluable to researchers or historians when looking back to the modality of conversation around these topics in the future. I thoroughly appreciate that, for the benefit of the user, the UKWA put a disclaimer mentioning that the collection “may contain websites that may be considered transphobic or homophobic. It is a principle of [their] collection development policy, underpinned by Legal Deposit Regulations, to include, uncensored, everything that is published online in the UK for the benefit of future researchers.”


British Library reading room
Image source: www.bl.uk


In addition to the presentation of the website ‘as preserved’, UKWA has also made available a wealth of secondary datasets. To access these you will need to use a dedicated website, webarchive.org.uk/shine. There are two main tools which allow a researcher to carry a textual analysis on the 3.5 billion indexed items collected between 1996 and April 2013: a search function, which looks for the words in the body of the website, and a trend function. Unfortunately, the search results are not sorted by relevance and further filtering and more complex research queries need to be used to find what you are looking for. The trend function looks at the frequency in which a word or a set of words appears in the collection of websites in a particular time frame and it uses a more visual approach. This has been interesting to use. Considering the vastness of the collection some research takes a while to complete.


I have tried a quick research looking for the occurrences of the expressions “library cuts'' and “library funding” in the body of text: interesting to note how the usage of the former expression increased towards the end of the period available with a peak in 2011, while the use of “library funding” decreased with time with the minimum occurrence in 2013. Another research I have done is on the frequency of the terms lesbian, gay, bisexual and transgender in the corpus. I was not surprised to see how there has been a peak of occurrences for the words lesbian and gay in 2000, while transgender and bisexual are just lurking at the bottom of the graph, with a very low percentage of occurrences; this gives an opportunity for reflection in term of the different representation of the people in the LGBTQ+ community. However, most of the occurrences come from amazon.co.uk, so there is certainly more exploration to be done on this topic, which falls outside the scope of this short blog post. These are just a few simple examples, but the possibilities for research are really endless! On top of this, it is also possible to filter research by postcode, or date or website, so just go and play with the tools if you are interested.


This project is already invaluable for all the legal deposit libraries preserving the potentially infinite production of online material which are lost to the passage of time. Historians, researchers and people with different interests will all be able to consult the recorded websites to have a better understanding and a clear picture of many trends or societal changes at the time they were reported on the internet.


Comments