Module Five - Capture

Overview and Objectives

Overview:

Capturing, harvesting, or crawling are usually used interchangeably to represent the acquisition process in web archiving. This module will take a deeper look at the process of acquiring content for a web archive, introduce you to some new terms, and give you a chance to create what might be your first web capture.

This will build on concepts that you were introduced to in Module Four.

There are several readings, some online documentation to skim, and several power points that you will review.

Objectives:

Become familiar with common capture related terms such as seed, path, domain, subdomain.
Understand where acquisition of content fits into the lifecycle of a web archive.
Create your first web capture in the Wayback Machine at the Internet Archive.

Readings

Web Archiving

Archive-It Help Center, Glossary of Archive-It and Web Archiving Terms - https://support.archive-it.org/hc/en-us/articles/208111686-Glossary-of-Archive-It-and-Web-Archiving-Terms
- Review these terms and pay specific attention to Crawl, Crawler, Document, Domain, Host, Scope, Seed, and Sub-domain.
About /robots.txt - https://www.robotstxt.org/robotstxt.html
- Become familiar with the concepts discussed on this page.
- This is how website owners can give hints to crawlers about what to crawl but usually what not to crawl.
International Internet Preservation Consortium (2020). Session 3A: Main Concepts and Technologies: Capture
- Slides - https://netpreserve.org/download/iipc-training-session-beginners-3a-slides/
- Speaker Notes - https://netpreserve.org/download/iipc-training-session-beginners-3a-notes/
- Review these slides and speaker notes. I suggest reading the notes when you have the slides open in another part of the screen.
Mohr, G., Stack, M., Ranitovic, I., Avery, D., & Kimpton, M. (2004). An Introduction to Heritrix An open source archival quality web crawler. International Web Archiving Workshop. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.676.6877&rep=rep1&type=pdf
- This article is a tiny bit dated but is the best reference about the beginnings of Heritrix that I could find.
Davis, R. (2011). Saving the Smithsonian's Web. https://siarchives.si.edu/blog/saving-smithsonians-web

Heritrix

Review the following links to learn more about the Heritrix Crawler.

https://github.com/internetarchive/heritrix3
https://github.com/internetarchive/heritrix3/wiki
https://netpreserveblog.wordpress.com/2019/02/19/a-new-release-of-heritrix-3/

Bonus Video Overview of Heritrix

Fisher, D. (2018). Heritrix Web Crawler. - https://www.youtube.com/watch?v=RmHG0MaFJSI
- This is a video from a peer of yours at Simmons in the School of Library & Information Science. I think they do a great job in the presentation overall.

Additional Optional Readings

Brunelle, J., Ferrante, K., Wilczek, E., Weigle, M. & Nelson, M. (2016). Leveraging Heritrix and the Wayback Machine on a Corporate Intranet: A Case Study on Improving Corporate Archives. D-LibMagazine. 22(1/2) https://doi.org/10.1045/january2016-brunelle

Archiving Exercise

Web Archiving Exercise - Saving a Webpage at the Internet Archive

This week we are going to save our first webpage in a full web archiving infrastructure.

Again we turn to the Internet Archive and their Wayback Machine.

If you navigate over to this page https://web.archive.org/ , in the bottom right side of the page you will see the "Save Page Now"

This will link you to the "Save Page Now" interface - https://web.archive.org/save

This week we are going to select a webpage and save it to the Internet Archive's collection.

When picking your page to archive keep in mind that you will be describing and linking to it in this weeks discussion post.

One thing to note, if you have an Internet Archive account, you will have more options when you register and sign in than you will without that account. For this exercise you do not need an account.

Select a webpage you want to archive, add it to the box, go ahead and leave the "Save error pages (HTTP Status=4xx, 5xx)" checked and "Save Page".

After you click the "Save Page" button leave the page open so that you can see what is happening.

In the discussion this week you will write a paragraph about what you see happening, How many files did you end up collecting that were included in that page? Share any observations you have in the process, and ask any questions you might have about the process. What are the potential uses of this kind of "micro" web archiving service?

You will need to link to the specific capture you initiated in the Wayback Machine.

Exploring Web Archives

Each week we will try and learn about a new web archive, a web archiving tool, or a web archiving service. The goal of this is to get an introduction to what is happening in the web archiving space, what is being collected, and who is collecting it.

This week we will look at the web archives at that UNT Libraries.

Web Archives

UNT Web Archives (UNT Digital Library Interface) - https://digital.library.unt.edu/explore/collections/UNTWEB/browse/?sort=date_d

UNT Libraries' Web Archives (Wayback Interface) - https://webarchive.library.unt.edu/

CyberCemetery - https://cybercemetery.unt.edu

Cathy Hartman and CyberCemetery - https://www.digitalpreservation.gov/series/pioneers/hartman.html
Hartman, C. N., Hastings, S. K., & Alemneh, D. G. (2004). The Cybercemetery: Prolonging Usable Afterlife. IS&T--the Society for Imaging Science and Technology. https://digital.library.unt.edu/ark:/67531/metadc29310/

UNT Libraries' Archive-It Collections - https://archive-it.org/organizations/1181

Primarly focused on Special Collections

Grant Project related to Web Archives

National Digital Information and Infrastructure Preservation Program: Web-at-Risk (2005-2008)

Web-at-Risk: Preserving Our Nation's Cultural Heritage - UNT Digital Library
Seneca, T. (2009). The Web-at-Risk at Three: Overview of an NDIIPP Web Archiving Project. Library Trends, 57(3), 427-441. http://hdl.handle.net/2142/13606

Expanding Collection Development Practices to Web Archives (EOTCD) (2009-2013)

Hartman, C. N., Murray, K. R., & Phillips, M. E., (2013). Classification Of The End-Of-Term Archive: Extending Collection Development Practices To Web Archives. https://digital.library.unt.edu/ark:/67531/metadc152437/
Murray, K. R., & Hartman, C. N. (2012). Classifying the End-of-Term Archive. IS & T--the Society for Imaging Science and Technology Archiving Conference, 2012, Copenhagen, Denmark. https://digital.library.unt.edu/ark:/67531/metadc93305/
Phillips, M. E., & Murray, K. R. (2013). Improving Access to Web Archives through Innovative Analysis of PDF Content. IS & T--the Society for Imaging Science and Technology Archiving Conference, 2013, Washington, D.C., United States. https://digital.library.unt.edu/ark:/67531/metadc155622/

Programmatic Extraction of 'Documents' from Web Archives (2017-2020)

Phillips, M. E. & Caragea, C. (2017) Programmatic Extraction of 'Documents' from Web Archives https://www.imls.gov/grants/awarded/lg-71-17-0202-17
Fox, N. T., Phillips, M. E., & Tarver, H. (2020). Programmatic Extraction of ‘Documents’ from Web Archives: Identifying Document Characteristics from Content Selector Interviews. https://digital.library.unt.edu/ark:/67531/metadc1757659/
Patel, K., Caragea, C. Phillips, M. E., & Fox., N. (2020). Identifying Documents In-Scope of a Collection from Web Archives. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 167-176 2020. https://doi.org/10.1145/3383583.3398540
- arXiv version - https://arxiv.org/abs/2009.00611

Discussion

Discussion Post:

In at least one paragraph, discuss what you learned about the capture technologies involved in the web archiving space. What were some of the terms that were new to you this week? What are some things that still need clarity for you?

In at least one paragraph, describe what happened when you used the Wayback Machines "Save Page Now" mechanism. What webpage did you choose to save? Link to your capture of the website in the Wayback Machine. How many files did you end up collecting that were included in that page? Share any observations you have in the process, and ask any questions you might have about the process. What are the potential uses of this kind of "micro" web archiving service?

Finally, in at least one paragraph, discuss the web archive or archived website that you reviewed this week from the UNT Libraries. Were you surprised by anything you found in the websites? Are there things that you would have expected that you didn't see? Discuss any of the related projects or grants that you explored as well.

Class Engagement:

After you have made the discussion post described above, take the time to response, comment, or engage with at least two of your classmates posts.

If there are any unanswered questions feel free to try and offer an answer or suggestion to the original poster. Did they mention something that made you investigate something further? If so, what was it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

module-05-capture.md

module-05-capture.md

Module Five - Capture

Overview and Objectives

Overview:

Objectives:

Readings

Web Archiving

Heritrix

Additional Optional Readings

Archiving Exercise

Web Archiving Exercise - Saving a Webpage at the Internet Archive

Exploring Web Archives

Web Archives

Grant Project related to Web Archives

Discussion

Discussion Post:

Class Engagement:

Files

module-05-capture.md

Latest commit

History

module-05-capture.md

File metadata and controls

Module Five - Capture

Overview and Objectives

Overview:

Objectives:

Readings

Web Archiving

Heritrix

Additional Optional Readings

Archiving Exercise

Web Archiving Exercise - Saving a Webpage at the Internet Archive

Exploring Web Archives

Web Archives

Grant Project related to Web Archives

Discussion

Discussion Post:

Class Engagement: