Skip to content

Settings

scrapy-webarchive makes use of the following settings, in addition to Scrapy's settings. Note that all the settings are prefixed with SW_.

Extensions

SW_EXPORT_URI

# Either configure the directory where the output should be uploaded to
SW_EXPORT_URI = "s3://scrapy-webarchive/"
SW_EXPORT_URI = "s3://scrapy-webarchive/{spider}/"
SW_EXPORT_URI = "s3://scrapy-webarchive/{year}/{month}/{day}/{spider}/"

# OR add the file name for full control of the output
SW_EXPORT_URI = "s3://scrapy-webarchive/output.wacz"
SW_EXPORT_URI = "s3://scrapy-webarchive/{spider}/output-{timestamp}.wacz"
SW_EXPORT_URI = "s3://scrapy-webarchive/{year}/{month}/{day}/{spider}-{timestamp}.wacz"

# Local (No scheme assumes it is "file://")
SW_EXPORT_URI = "file:///path/to/output/{spider}/"
SW_EXPORT_URI = "/path/to/output/{spider}/"

This is the output path of the WACZ file. Multiple variables can be added that allow dynamic generation of the output path.

Supported variables: spider, year, month, day and timestamp.

SW_WACZ_TITLE

This setting defines the title of the WACZ used in the datapackage.json, which is generated durning the WACZ creation. It will default to the spider name if it is not configured.

SW_WACZ_DESCRIPTION

This setting defines the description of the WACZ used in the datapackage.json, which is generated durning the WACZ creation. It will default to the spider name if it is not configured. Defaults to:

This is the web archive generated by a scrapy-webarchive extension for the spider. It is mainly for scraping purposes as it does not contain any js/css data. Though it can be replayed as bare HTML if the site does not depend on JavaScript.

Downloader middleware and spider middleware

SW_WACZ_SOURCE_URI

⚠️ Scraping against a remote source currently only supports AWS S3.

# "file://" must be explicitly added, unlike SW_EXPORT_URI where it makes an assumption if no scheme is added.
SW_WACZ_SOURCE_URI = "file:///Users/username/Documents/archive.wacz"
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"

# Allows multiple sources, comma seperated.
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz,file:///Users/username/Documents/archive.wacz"

This setting defines the location of the WACZ file that should be used as a source for the crawl job.

SW_WACZ_CRAWL

SW_WACZ_CRAWL = True

Setting to ignore original start_requests, just yield all responses found in WACZ.