Skip to content

Settings

scrapy-webarchive makes use of the following settings, in addition to Scrapy's settings. Note that all the settings are prefixed with SW_.

Extensions

SW_EXPORT_URI

settings.py
# Either configure the directory where the output should be uploaded to
SW_EXPORT_URI = "s3://scrapy-webarchive/"
SW_EXPORT_URI = "s3://scrapy-webarchive/{spider}/"
SW_EXPORT_URI = "s3://scrapy-webarchive/{year}/{month}/{day}/{spider}/"

# OR add the file name for full control of the output
SW_EXPORT_URI = "s3://scrapy-webarchive/output.wacz"
SW_EXPORT_URI = "s3://scrapy-webarchive/{spider}/output-{timestamp}.wacz"
SW_EXPORT_URI = "s3://scrapy-webarchive/{year}/{month}/{day}/{spider}-{timestamp}.wacz"

# Local (No scheme assumes it is "file://")
SW_EXPORT_URI = "file:///path/to/output/{spider}/"
SW_EXPORT_URI = "/path/to/output/{spider}/"

This is the output path of the WACZ file. Multiple variables can be added that allow dynamic generation of the output path.

Supported variables: spider, year, month, day and timestamp.

SW_WACZ_TITLE

This setting defines the title of the WACZ used in the datapackage.json, which is generated durning the WACZ creation. It will default to the spider name if it is not configured.

SW_WACZ_DESCRIPTION

This setting defines the description of the WACZ used in the datapackage.json, which is generated durning the WACZ creation. It will default to the spider name if it is not configured. Defaults to:

This is the web archive generated by a scrapy-webarchive extension for the spider. It is mainly for scraping purposes as it does not contain any js/css data. Though it can be replayed as bare HTML if the site does not depend on JavaScript.

Downloader middleware and spider middleware

SW_WACZ_SOURCE_URI

⚠️ Scraping against a remote source currently only supports AWS S3.

settings.py
# "file://" must be explicitly added, unlike SW_EXPORT_URI where it makes an assumption if no scheme is added.
SW_WACZ_SOURCE_URI = "file:///Users/username/Documents/archive.wacz"
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"

# Allows multiple sources, comma seperated.
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz,file:///Users/username/Documents/archive.wacz"

This setting defines the location of the WACZ file that should be used as a source for the crawl job.

SW_WACZ_CRAWL

settings.py
SW_WACZ_CRAWL = True

Setting to ignore original start_requests, just yield all responses found in WACZ. For more information see Iterating a WACZ archive index.

SW_WACZ_LOOKUP_STRATEGY

settings.py
SW_WACZ_LOOKUP_STRATEGY = "after"

A setting that can be used in combination with SW_EXPORT_URI and SW_WACZ_LOOKUP_TARGET to automatically resolve the most relevant archive for scraping.

Supported strategies include after and before. However, the implementation of custom strategies is also supported. See Custom strategies.

SW_WACZ_LOOKUP_TARGET

This setting is used in combination with SW_WACZ_LOOKUP_STRATEGY to determine the most relevant archive for scraping based on the specified time. The value must be an ISO 8601 formatted timestamp (YYYY-MM-DDTHH:MM:SS), representing the preferred point in time for file selection.

settings.py
SW_WACZ_LOOKUP_TARGET = "2025-01-01T00:00:00"

How It Works

When SW_EXPORT_URI is set, SW_WACZ_LOOKUP_TARGET helps locate the closest matching file based on time. The strategy defined in SW_WACZ_LOOKUP_STRATEGY determines how files are selected relative to this timestamp. All of them are required in order to enable this feature.

⚠️ When SW_WACZ_SOURCE_URI is set, these settings won't have any effect.

Example Scenarios

  • before strategy: Selects the file with the closest last modified date before or exactly at the given timestamp.
  • after strategy: Selects the file with the closest last modified date after or exactly at the given timestamp.