Skip to content

Usage

Exporting

Exporting a WACZ archive

To archive the requests/responses during a crawl job you need to enable the WaczExporter extension.

EXTENSIONS = {
    "scrapy_webarchive.extensions.WaczExporter": 543,
}

This extension also requires you to set the export location using the SW_EXPORT_URI settings.

SW_EXPORT_URI = "s3://scrapy-webarchive/"

Running a crawl job using these settings will result in a newly created WACZ file.

Crawling

There are 2 ways to crawl against a WACZ archive. Choose a strategy that you want to use for your crawl job, and follow the instruction as described below.

Lookup in a WACZ archive

One of the ways to crawl against a WACZ archive is to use the WaczMiddleware downloader middleware. Instead of fetching the live resource the middleware will instead retrieve it from the archive and recreate a response using the data from the archive.

To use the downloader middleware, enable it in the settings like so:

DOWNLOADER_MIDDLEWARES = {
    "scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543,
}

Then define the location of the WACZ archive with SW_WACZ_SOURCE_URI setting:

SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
SW_WACZ_CRAWL = True

Iterating a WACZ archive

Going around the default behaviour of the spider, the WaczCrawlMiddleware spider middleware will, when enabled, replace the crawl by an iteration through all the entries in the WACZ archive index. Then, similar to the previous strategy, it will recreate a response using the data from the archive.

To use this strategy, enable both middlewares in the spider settings like so:

DOWNLOADER_MIDDLEWARES = {
    "scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543,
}

SPIDER_MIDDLEWARES = {
    "scrapy_webarchive.spidermiddlewares.WaczCrawlMiddleware": 543,
}

Then define the location of the WACZ archive with SW_WACZ_SOURCE_URI setting:

SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
SW_WACZ_CRAWL = True