Settings
scrapy-webarchive
makes use of the following settings, in addition to Scrapy's settings. Note that all the settings are prefixed with SW_
.
Extensions
SW_EXPORT_URI
# Either configure the directory where the output should be uploaded to
SW_EXPORT_URI = "s3://scrapy-webarchive/"
SW_EXPORT_URI = "s3://scrapy-webarchive/{spider}/"
SW_EXPORT_URI = "s3://scrapy-webarchive/{year}/{month}/{day}/{spider}/"
# OR add the file name for full control of the output
SW_EXPORT_URI = "s3://scrapy-webarchive/output.wacz"
SW_EXPORT_URI = "s3://scrapy-webarchive/{spider}/output-{timestamp}.wacz"
SW_EXPORT_URI = "s3://scrapy-webarchive/{year}/{month}/{day}/{spider}-{timestamp}.wacz"
# Local (No scheme assumes it is "file://")
SW_EXPORT_URI = "file:///path/to/output/{spider}/"
SW_EXPORT_URI = "/path/to/output/{spider}/"
This is the output path of the WACZ file. Multiple variables can be added that allow dynamic generation of the output path.
Supported variables: spider
, year
, month
, day
and timestamp
.
SW_WACZ_TITLE
This setting defines the title of the WACZ used in the datapackage.json
, which is generated durning the WACZ creation. It will default to the spider name if it is not configured.
SW_WACZ_DESCRIPTION
This setting defines the description of the WACZ used in the datapackage.json
, which is generated durning the WACZ creation. It will default to the spider name if it is not configured. Defaults to:
This is the web archive generated by a scrapy-webarchive extension for the
spider. It is mainly for scraping purposes as it does not contain any js/css data. Though it can be replayed as bare HTML if the site does not depend on JavaScript.
Downloader middleware and spider middleware
SW_WACZ_SOURCE_URI
⚠️ Scraping against a remote source currently only supports AWS S3.
# "file://" must be explicitly added, unlike SW_EXPORT_URI where it makes an assumption if no scheme is added.
SW_WACZ_SOURCE_URI = "file:///Users/username/Documents/archive.wacz"
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
# Allows multiple sources, comma seperated.
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz,file:///Users/username/Documents/archive.wacz"
This setting defines the location of the WACZ file that should be used as a source for the crawl job.
SW_WACZ_CRAWL
Setting to ignore original start_requests
, just yield all responses found in WACZ. For more information see Iterating a WACZ archive index.
SW_WACZ_LOOKUP_STRATEGY
A setting that can be used in combination with SW_EXPORT_URI
and SW_WACZ_LOOKUP_TARGET
to automatically resolve the most relevant archive for scraping.
Supported strategies include after
and before
. However, the implementation of custom strategies is also supported. See Custom strategies.
SW_WACZ_LOOKUP_TARGET
This setting is used in combination with SW_WACZ_LOOKUP_STRATEGY
to determine the most relevant archive for scraping based on the specified time. The value must be an ISO 8601 formatted timestamp (YYYY-MM-DDTHH:MM:SS), representing the preferred point in time for file selection.
How It Works
When SW_EXPORT_URI
is set, SW_WACZ_LOOKUP_TARGET
helps locate the closest matching file based on time.
The strategy defined in SW_WACZ_LOOKUP_STRATEGY
determines how files are selected relative to this timestamp.
All of them are required in order to enable this feature.
⚠️ When SW_WACZ_SOURCE_URI
is set, these settings won't have any effect.
Example Scenarios
before
strategy: Selects the file with the closest last modified date before or exactly at the given timestamp.after
strategy: Selects the file with the closest last modified date after or exactly at the given timestamp.