sangaline/wayback-machine-scraper — A command-line utility and Scrapy middleware for scrapi

The Wayback Machine Scraper

The repository consists of a command-line utility wayback-machine-scraper that can be used to scrape or download website data as it appears in archive.org's Wayback Machine. It crawls through historical snapshots of a website and saves the snapshots to disk. This can be useful when you're trying to scrape a site that has scraping measures that make direct scraping impossible or prohibitively slow. It's also useful if you want to scrape a website as it appeared at some point in the past or to scrape information that changes over time.

The command-line utility is highly configurable in terms of what it scrapes but it only saves the unparsed content of the pages on the site. If you're interested in parsing data from the pages that are crawled then you might want to check out scrapy-wayback-machine instead. It's a downloader middleware that handles all of the tricky parts and passes normal response objects to your Scrapy spiders with archive timestamp information attached. The middleware is very unobtrusive and should work seamlessly with existing Scrapy middlewares, extensions, and spiders. It's what wayback-machine-scraper uses behind the scenes and it offers more flexibility for advanced use cases.

Installation

The package can be installed using pip.

pip install wayback-machine-scraper

Command-Line Interface

Writing a custom Scrapy spider and using the WaybackMachine middleware is the preferred way to use this project, but a command line interface for basic mirroring is also included. The usage information can be printed by running wayback-machine-scraper -h.

usage: wayback-machine-scraper [-h] [-o DIRECTORY] [-f TIMESTAMP]

wayback-machine-scraper

Quick Overview

Scores

Trust Score

Maintenance

Popularity

Star History

Snapshot Versions

Alternatives

n8n

firecrawl

browser-use

scrapy

anything-llm

Scrapling

Community Reviews

README

The Wayback Machine Scraper

Installation

Command-Line Interface