Skip to content

Settings

scrapy-webarchive makes use of the following settings, in addition to Scrapy's settings. Note that all the settings are prefixed with SW_.

Extensions

SW_EXPORT_URI

# Either configure the directory where the output should be uploaded to
SW_EXPORT_URI = "s3://scrapy-webarchive/"
SW_EXPORT_URI = "s3://scrapy-webarchive/{spider}/"
SW_EXPORT_URI = "s3://scrapy-webarchive/{year}/{month}/{day}/{spider}/"

# OR add the file name for full control of the output
SW_EXPORT_URI = "s3://scrapy-webarchive/output.wacz"
SW_EXPORT_URI = "s3://scrapy-webarchive/{spider}/output-{timestamp}.wacz"
SW_EXPORT_URI = "s3://scrapy-webarchive/{year}/{month}/{day}/{spider}-{timestamp}.wacz"

This is the output path of the WACZ file. Multiple variables can be added that allow dynamic generation of the output path.

Supported variables: spider, year, month, day and timestamp.

SW_WACZ_TITLE

This setting defines the title of the WACZ used in the datapackage.json, which is generated durning the WACZ creation. It will default to the spider name if it is not configured.

SW_WACZ_DESCRIPTION

This setting defines the description of the WACZ used in the datapackage.json, which is generated durning the WACZ creation. It will default to the spider name if it is not configured. Defaults to:

This is the web archive generated by a scrapy-webarchive extension for the spider. It is mainly for scraping purposes as it does not contain any js/css data. Though it can be replayed as bare HTML if the site does not depend on JavaScript.

Downloader middleware and spider middleware

SW_WACZ_SOURCE_URI

SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"

# Allows multiple sources, comma seperated.
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz,/path/to/archive.wacz"

This setting defines the location of the WACZ file that should be used as a source for the crawl job.

SW_WACZ_CRAWL

SW_WACZ_CRAWL = True

Setting to ignore original start_requests, just yield all responses found.

SW_WACZ_TIMEOUT

SW_WACZ_TIMEOUT = 60

Transport parameter for retrieving the SW_WACZ_SOURCE_URI from the defined location.