Scrapy Webarchive

scrapy-webarchive is a plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.

Features

Save web crawls in WACZ format (multiple storages supported; local and cloud).
Crawl against WACZ format archives.
Integrate seamlessly with Scrapy’s spider request and response cycle.

Limitations

WACZ supports saving images but this module does not yet integrate with Scrapy's image/file pipeline for retrieving images/files from the WACZ. Future support for this feature is planned.

Source Code: https://github.com/q-m/scrapy-webarchive

Credits

This package started as a fork of https://github.com/internetarchive/scrapy-warcio. The idea of turning its functionality into an extension as well as actually writing the WARC files is based on it.