Scrapy Webarchive
scrapy-webarchive
is a plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.
Features
- Save web crawls in WACZ format (multiple storages supported; local and cloud).
- Crawl against WACZ format archives.
- Integrate seamlessly with Scrapy’s spider request and response cycle.
Limitations
- WACZ supports saving images but this module does not yet integrate with Scrapy's image/file pipeline for retrieving images/files from the WACZ. Future support for this feature is planned.
Source Code: https://github.com/q-m/scrapy-webarchive
Credits
This package started as a fork of https://github.com/internetarchive/scrapy-warcio. The idea of turning its functionality into an extension as well as actually writing the WARC files is based on it.