Thursday, February 13, 2025

Web Archiving Guide for Librarians & Patrons | How to Preserve Websites & Data

 Learn how to archive websites and preserve digital content with this detailed guide for librarians and patrons. Discover the importance of web archiving, step-by-step instructions for using the Wayback Machine and other tools, and how you can participate in saving valuable online data.



Guide to Web Archiving for Librarians and Patrons

1. Introduction

What is Web Archiving?
Web archiving collects, preserves, and manages web-based information to remain accessible to future researchers, historians, journalists, and citizens. It ensures that websites, online documents, data, and other content are not lost when a site is removed, altered, or taken offline.

Why Web Archiving Matters

  • Preserving Public Records: Government agencies and other organizations publish essential data online. Without web archiving, historical records disappear if those agencies remove or change pages.
  • Accountability & Transparency: Journalists, researchers, and the public can track changes over time, ensuring that data is not quietly rewritten or purged.
  • Historical Research: Future scholars rely on these archives to understand past events, policies, and social conditions.
  • Public Access: Web archives allow anyone to view old versions of websites without special permissions or proprietary software.

2. Why It Is Important to Archive Websites

  1. Government Data & Public Policy

    • Federal websites (e.g., https://www.census.gov/ for Census data) can change or remove information without notice. Archiving helps preserve everything from demographic data to environmental statistics.
  2. Local Government & Community Information

    • County or city portals often host meeting minutes, budget documents, and other records. However, these can disappear if a site is redesigned or if new administrators decide to remove them.
  3. Research & Academic Integrity

    • Universities and research labs frequently post datasets and study results. If grants change or departments merge, these pages can vanish. An archive keeps these resources alive for long-term study.
  4. Journalistic & Investigative Purposes

    • Investigative reporters use historical snapshots of web pages to compare past statements or track the history of government agencies, corporations, or organizations.
  5. Cultural & Social Heritage

    • The internet captures our modern culture—memes, social movements, and community-driven projects. However, if these records are preserved, future generations can learn from them.

3. How You Can Participate in Web Archiving

  1. Submit URLs to the Internet Archive

    • Most people can help by saving pages on the Wayback Machine. Section 4.1 below provides more details.
  2. Identify Vulnerable Content

    • Look for data sets or web pages that might be at risk (e.g., government sites and project pages from local or smaller agencies that lack robust preservation plans).
  3. Join Data Rescue Efforts

    • Data Liberation Project: Follow announcements and sign up to help identify and archive threatened data.
    • Data Rescue Project: Look for local “Data Rescue” events or join the broader online community to help find, download, and store critical information.
  4. File FOIA Requests

    • If you suspect data has already been removed, you can use MuckRock to file Freedom of Information Act (FOIA) requests.
  5. Volunteer Technical Skills

    • You might help with specialized web crawling or bulk data downloads if you have programming or data management skills. Objects like Big Local News and the End-of-Termhive often welcome coders, data analysts, and other volunteers.
  6. Spread the Word

    • Encourage your community to save pages of interest. The more people who know how to archive, the less likely vital data will vanish.

4. Key Tools & Platforms

4.1 Internet Archive’s Wayback Machine

  • URL: https://archive.org/web
  • Purpose: Captures website snapshots (“crawls”) for long-term preservation and public access.
  • How to Save a Webpage:
    1. Go to https://archive.org/web.
    2. In the “Save Page Now” box, paste the URL of the page you want to archive.
    3. Click SAVE PAGE.
    4. Wait for the snapshot to process; the Wayback Machine will give you a permanent archived link.
  • Submitting Lists in Bulk
    • Use the Wayback Machine’s Google Sheets submission tool for large batches of URLs. Create a spreadsheet of URLs, then submit them all at once.

4.2 End of Term (EOT) Archive

4.3 Big Local News

  • URL: https://biglocalnews.org
  • Purpose: Helps local newsrooms collect and analyze public data.
  • Participation:
    • Contact Big Local News if you have local datasets or want to volunteer data analysis skills.

4.4 Data Liberation Project & MuckRock

  • Data Liberation Project

  • MuckRock

    • URL: https://www.muckrock.com
    • Overview: Non-profit collaborative news site that facilitates FOIA requests and hosts a massive repository of government documents.
    • DocumentCloud: (Included under the MuckRock Foundation umbrella) used by thousands of newsrooms to organize, annotate, and publish primary source documents.
    • How to Participate:
      • Suggest FOIA requests for missing or altered data sets.
      • Check out MuckRock’s training on transparency and investigative journalism.

4.5 Library Innovation Lab

  • URL: https://lil.law.harvard.edu/
  • Overview: A software and design lab at the Harvard Law School Library dedicated to building open knowledge projects.
  • Director: Jack Cushman.

5. Step-by-Step: Teaching Patrons How to Archive a Website

  1. Identify the Website

    • Encourage patrons to choose a page containing potentially at-risk info—e.g., local government meeting minutes, federal datasets, or specialized research.
  2. Use the Wayback Machine

    • Visit https://archive.org/web.
    • Paste the URL into the “Save Page Now” field.
    • Click SAVE PAGE to capture a snapshot.
  3. Verify the Snapshot

    • Once archived, verify the page’s text, images, and download links (if any) are captured. Some dynamic content might not be fully captured; advanced tools can help.
  4. Document the Archive

    • Store the archived URL in a shared spreadsheet or library resource guide. Record the date, the original URL, and the archived link.
  5. Contribute to Collaborative EffortsIf: If relevant, share the archived URL with the End-of-Termhive or the Data Rescue Project (or a similar initiative). The content is part of a more extensive database and coordinates with projects like Big Local News.


6. Tips for Going Beyond Basic Archiving

  • Bulk Archiving: Use Wayback Machine’s Google Sheets tool or specialized crawlers like Webrecorder (for interactive pages).
  • Local Data Preservation: Encourage patrons to check local municipality or county sites. Sometimes, local data is even more vulnerable to loss than federal data.
  • Collaborate With Other Institutions: Universities, public libraries, and historical societies often have digital preservation or IT departments that can help manage large-scale archiving.
  • Digital Tools & Scripting: Patrons with coding skills may explore Python-based tools like ArchiveBot or Heritrix.
  • Advocacy & Policy: Teach patrons that archiving is also about awareness—encourage them to support policies that require better government transparency and data retention.

7. Recommended Links and Resources

Below are the direct links (all publicly available) from presenters and attendees mentioned in your notes:

Subscription/Contact Links:


8. Key People & Their Roles

  • Mark Graham

    • Role: Director of the Wayback Machine at the Internet Archive
    • Focus: Archiving the web daily, ensuring it’s publicly accessible.
    • URL: https://archive.org/web
  • Sarah Cohen

    • Role: Works with Big Local News, trains local journalists
    • Focus: Data analysis for investigative stories
    • URL: https://biglocalnews.org
  • Jack Cushman

    • Role: Director, Library Innovation Lab at Harvard Law School
    • Focus: Merging library principles with software, design, and legal innovation
    • URL: https://lil.law.harvard.edu/
  • Lynda Kellam, PhD

    • Role: Secretary of IASSIST, longtime academic data librarian
    • Focus: Data management, government information, and stewardship
    • Related Projects: https://iassistdata.org/
  • Michael Morisy

    • Role: Chief Executive Officer at MuckRock
    • Focus: FOIA requests, transparency, investigative journalism training
    • URL: https://www.muckrock.com

9. Frequently Asked Questions (FAQs)

  1. How can non-US patrons or institutions help?

    • Submit URLs to https://archive.org/web, share local or international data sets, and mirror archives if you have the server capacity.
  2. What if the page has interactive elements like maps or tools?

    • The basic Wayback Machine capture might not include dynamic content. Consider using tools like Webrecorder to capture interactive sessions.
  3. How can I find which datasets need archiving the most?

    • Check: You can check the Data Rescue Tracker (or partner sites), which often lists priority datasets. You can also ask in the Data Liberation Project Slack or MuckRock communities.
  4. Is it legal to archive any webpage?

    • Generally, capturing publicly available web pages for preservation is considered fair use or library/archive practice in many jurisdictions. If you’re unsure, consult your library’s legal guidelines or resources.
  5. Is there a single, comprehensive list of everything being archived?

    • No single list exists because many groups run parallel efforts. However, the Internet Archive is the largest aggregator. The End of Term Archive focuses on federal government websites.
  6. How can we preserve large datasets like Census data?

  7. What if I have hard drive space to donate?

    • You can contact the Data Rescue Project or the Internet Archive to see if they accept mirrored data. Some projects prefer distributed backups.

10. Workshop/Session Notes

  • Keep It Interactive: Encourage patrons to try saving a webpage themselves during your session.
  • Highlight Collaboration: Show them how to share archived URLs with others or how to add them to a public spreadsheet.
  • Questions & Answer Time: Collect questions in a shared document (e.g., Google Doc), similar to the approach used in the event notes you provided, so everyone can benefit from the discussion.
  • Follow-Up: After the session, provide a read-only version of your collaborative document with all resources, archived links, and Q&A for future reference.

Conclusion

Web archiving is a powerful way to protect our collective digital heritage. By learning to capture at-risk websites and data, librarians and patrons can ensure vital information remains accessible to researchers, journalists, and the public for decades. Remember:

Every URL you save helps preserve the historical record.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Featured Post

Defending Academic Freedom: The Role of Librarians in Protecting Higher Education and Historical Truth

  The Attack on Higher Education: Why Librarians Must Defend Academic Freedom Higher education has long been a battleground for Knowledge, d...