
The Wayback Machine Scraper
The repository consists of a command-line utility wayback-machine-scraper that can be used to scrape or download website data as it appears in archive.org's Wayback Machine.
It crawls through historical snapshots of a website and saves the snapshots to disk.
This can be useful when you're trying to scrape a site that has scraping measures that make direct scraping impossible or prohibitively slow.
It's also useful if you want to scrape a website as it appeared at some point in the past or to scrape information that changes over time.
The command-line utility is highly configurable in terms of what it scrapes but it only saves the unparsed content of the pages on the site.
If you're interested in parsing data from the pages that are crawled then you might want to check out scrapy-wayback-machine instead.
It's a downloader middleware that handles all of the tricky parts and passes normal response objects to your Scrapy spiders with archive timestamp information attached.
The middleware is very unobtrusive and should work seamlessly with existing Scrapy middlewares, extensions, and spiders.
It's what wayback-machine-scraper uses behind the scenes and it offers more flexibility for advanced use cases.
Installation
The package can be installed using pip.
pip install wayback-machine-scraper
Command-Line Interface
Writing a custom Scrapy spider and using the WaybackMachine middleware is the preferred way to use this project, but a command line interface for basic mirroring is also included.
The usage information can be printed by running wayback-machine-scraper -h.
usage: wayback-machine-scraper [-h] [-o DIRECTORY] [-f TIMESTAMP]