Skip to content

Latest commit

 

History

History
127 lines (80 loc) · 9.23 KB

module-05-capture.md

File metadata and controls

127 lines (80 loc) · 9.23 KB

Module Five - Capture

Overview and Objectives

Overview:

Capturing, harvesting, or crawling are usually used interchangeably to represent the acquisition process in web archiving. This module will take a deeper look at the process of acquiring content for a web archive, introduce you to some new terms, and give you a chance to create what might be your first web capture.

This will build on concepts that you were introduced to in Module Four.

There are several readings, some online documentation to skim, and several power points that you will review.

Objectives:

  1. Become familiar with common capture related terms such as seed, path, domain, subdomain.
  2. Understand where acquisition of content fits into the lifecycle of a web archive.
  3. Create your first web capture in the Wayback Machine at the Internet Archive.

Readings

Web Archiving

Heritrix

Review the following links to learn more about the Heritrix Crawler.

Bonus Video Overview of Heritrix

  • Fisher, D. (2018). Heritrix Web Crawler. - https://www.youtube.com/watch?v=RmHG0MaFJSI
    • This is a video from a peer of yours at Simmons in the School of Library & Information Science. I think they do a great job in the presentation overall.

Additional Optional Readings

  • Brunelle, J., Ferrante, K., Wilczek, E., Weigle, M. & Nelson, M. (2016). Leveraging Heritrix and the Wayback Machine on a Corporate Intranet: A Case Study on Improving Corporate Archives. D-LibMagazine. 22(1/2) https://doi.org/10.1045/january2016-brunelle

Archiving Exercise

Web Archiving Exercise - Saving a Webpage at the Internet Archive

This week we are going to save our first webpage in a full web archiving infrastructure.

Again we turn to the Internet Archive and their Wayback Machine.

If you navigate over to this page https://web.archive.org/ , in the bottom right side of the page you will see the "Save Page Now"

Alt

This will link you to the "Save Page Now" interface - https://web.archive.org/save

This week we are going to select a webpage and save it to the Internet Archive's collection.

When picking your page to archive keep in mind that you will be describing and linking to it in this weeks discussion post.

One thing to note, if you have an Internet Archive account, you will have more options when you register and sign in than you will without that account. For this exercise you do not need an account.

Select a webpage you want to archive, add it to the box, go ahead and leave the "Save error pages (HTTP Status=4xx, 5xx)" checked and "Save Page".

After you click the "Save Page" button leave the page open so that you can see what is happening.

In the discussion this week you will write a paragraph about what you see happening, How many files did you end up collecting that were included in that page? Share any observations you have in the process, and ask any questions you might have about the process. What are the potential uses of this kind of "micro" web archiving service?

You will need to link to the specific capture you initiated in the Wayback Machine.

Exploring Web Archives

Each week we will try and learn about a new web archive, a web archiving tool, or a web archiving service. The goal of this is to get an introduction to what is happening in the web archiving space, what is being collected, and who is collecting it.

This week we will look at the web archives at that UNT Libraries.

Web Archives

UNT Web Archives (UNT Digital Library Interface) - https://digital.library.unt.edu/explore/collections/UNTWEB/browse/?sort=date_d

UNT Libraries' Web Archives (Wayback Interface) - https://webarchive.library.unt.edu/

CyberCemetery - https://cybercemetery.unt.edu

UNT Libraries' Archive-It Collections - https://archive-it.org/organizations/1181

  • Primarly focused on Special Collections

Grant Project related to Web Archives

National Digital Information and Infrastructure Preservation Program: Web-at-Risk (2005-2008)

  • Web-at-Risk: Preserving Our Nation's Cultural Heritage - UNT Digital Library
  • Seneca, T. (2009). The Web-at-Risk at Three: Overview of an NDIIPP Web Archiving Project. Library Trends, 57(3), 427-441. http://hdl.handle.net/2142/13606

Expanding Collection Development Practices to Web Archives (EOTCD) (2009-2013)

Programmatic Extraction of 'Documents' from Web Archives (2017-2020)

Discussion

Discussion Post:

In at least one paragraph, discuss what you learned about the capture technologies involved in the web archiving space. What were some of the terms that were new to you this week? What are some things that still need clarity for you?

In at least one paragraph, describe what happened when you used the Wayback Machines "Save Page Now" mechanism. What webpage did you choose to save? Link to your capture of the website in the Wayback Machine. How many files did you end up collecting that were included in that page? Share any observations you have in the process, and ask any questions you might have about the process. What are the potential uses of this kind of "micro" web archiving service?

Finally, in at least one paragraph, discuss the web archive or archived website that you reviewed this week from the UNT Libraries. Were you surprised by anything you found in the websites? Are there things that you would have expected that you didn't see? Discuss any of the related projects or grants that you explored as well.

Class Engagement:

After you have made the discussion post described above, take the time to response, comment, or engage with at least two of your classmates posts.

If there are any unanswered questions feel free to try and offer an answer or suggestion to the original poster. Did they mention something that made you investigate something further? If so, what was it?