Settings
scrapy-webarchive
makes use of the following settings, in addition to Scrapy's settings. Note that all the settings are prefixed with SW_
.
Extensions
SW_EXPORT_URI
# Either configure the directory where the output should be uploaded to
SW_EXPORT_URI = "s3://scrapy-webarchive/"
SW_EXPORT_URI = "s3://scrapy-webarchive/{spider}/"
SW_EXPORT_URI = "s3://scrapy-webarchive/{year}/{month}/{day}/{spider}/"
# OR add the file name for full control of the output
SW_EXPORT_URI = "s3://scrapy-webarchive/output.wacz"
SW_EXPORT_URI = "s3://scrapy-webarchive/{spider}/output-{timestamp}.wacz"
SW_EXPORT_URI = "s3://scrapy-webarchive/{year}/{month}/{day}/{spider}-{timestamp}.wacz"
# Local (No scheme assumes it is "file://")
SW_EXPORT_URI = "file:///path/to/output/{spider}/"
SW_EXPORT_URI = "/path/to/output/{spider}/"
This is the output path of the WACZ file. Multiple variables can be added that allow dynamic generation of the output path.
Supported variables: spider
, year
, month
, day
and timestamp
.
SW_WACZ_TITLE
This setting defines the title of the WACZ used in the datapackage.json
, which is generated durning the WACZ creation. It will default to the spider name if it is not configured.
SW_WACZ_DESCRIPTION
This setting defines the description of the WACZ used in the datapackage.json
, which is generated durning the WACZ creation. It will default to the spider name if it is not configured. Defaults to:
This is the web archive generated by a scrapy-webarchive extension for the
spider. It is mainly for scraping purposes as it does not contain any js/css data. Though it can be replayed as bare HTML if the site does not depend on JavaScript.
Downloader middleware and spider middleware
SW_WACZ_SOURCE_URI
⚠️ Scraping against a remote source currently only supports AWS S3.
# "file://" must be explicitly added, unlike SW_EXPORT_URI where it makes an assumption if no scheme is added.
SW_WACZ_SOURCE_URI = "file:///Users/username/Documents/archive.wacz"
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
# Allows multiple sources, comma seperated.
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz,file:///Users/username/Documents/archive.wacz"
This setting defines the location of the WACZ file that should be used as a source for the crawl job.
SW_WACZ_CRAWL
SW_WACZ_CRAWL = True
Setting to ignore original start_requests
, just yield all responses found in WACZ.