Advanced usage

Crawling

Skipping specific requests

The wacz_crawl_skip flag is applied to requests that should be ignored by the crawler. When this flag is present, the middleware intercepts the request and prevents it from being processed further, skipping both download and parsing. This is useful in scenarios where the request should not be collected during a scraping session. Usage:

yield Request(url, callback=cb_func, flags=["wacz_crawl_skip"])

When this happens, the statistic webarchive/crawl_skip is increased.

Disallowing archived URLs

If the spider has the attribute archive_disallow_regexp, all requests returned from the spider that match this regular expression, are ignored. For example, when a product page was returned in start_requests, but the product page disappeared and redirected to its category page, the category page can be disallowed, so as to avoid crawling the whole category, which would take much more time and could lead to unknown URLs (e.g. the spider's requested pagination size could be different from the website default).

When this happens, the statistic wacz/crawl_skip/disallowed is increased.

Iterating a WACZ archive index

When using a WACZ file that is not generated by your own spiders, it might be that the spider for crawling is not in place. In order to crawl this WACZ you need to tailor a spider to work with this specific WACZ file. This will require building the spider different to what it is supposed to look like with a live resource.

Going around the default behaviour of the spider, the WaczCrawlMiddleware spider middleware will, when enabled, replace the crawl by an iteration through all the entries in the WACZ archive index.

Configuration

To use this strategy, enable both the spider- and the downloadermiddleware in the spider settings like so:

settings.py

DOWNLOADER_MIDDLEWARES = {
    "scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543,
}

SPIDER_MIDDLEWARES = {
    "scrapy_webarchive.spidermiddlewares.WaczCrawlMiddleware": 543,
}

Then define the location of the WACZ archive with SW_WACZ_SOURCE_URI setting:

settings.py

SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
SW_WACZ_CRAWL = True

Controlling the crawl

Not all URLs will be interesting for the crawl since your WACZ will most likely contain static files such as fonts, JavaScript (website and external), stylesheets, etc. In order to improve the performance of the spider by not reading all the irrelevant request/response entries, you can configure the following atrribute in your spider, archive_regex:

my_wacz_spider.py

from scrapy.spiders import Spider


class MyWaczSpider(Spider):
    name = "myspider"
    archive_regex = r"^/tag/[\w-]+/$"

If the spider has an archive_regexp attribute, only response URLs matching this regexp are presented in start_requests. To visualise that, the spider above will only crawl the indented cdxj records below:

com,toscrape,quotes)/favicon.ico 20241007081411465 {...}
com,gstatic,fonts)/s/raleway/v34/1ptug8zys_skggpnyc0it4ttdfa.woff2 {...}
com,googleapis,fonts)/css?family=raleway%3A400%2C700 20241007081525229 {...}
com,toscrape,quotes)/static/bootstrap.min.css 20241007081525202 {...}
com,toscrape,quotes)/static/main.css 20241007081525074 {...}
> com,toscrape,quotes)/tag/books/ 20241007081513898 {...}
> com,toscrape,quotes)/tag/friends/ 20241007081520928 {...}
> com,toscrape,quotes)/tag/friendship/ 20241007081519648 {...}
> com,toscrape,quotes)/tag/humor/ 20241007081512594 {...}
> com,toscrape,quotes)/tag/inspirational/ 20241007081506990 {...}
> com,toscrape,quotes)/tag/life/ 20241007081510349 {...}
> com,toscrape,quotes)/tag/love/ 20241007081503814 {...}
> com,toscrape,quotes)/tag/reading/ 20241007081516781 {...}
> com,toscrape,quotes)/tag/simile/ 20241007081524944 {...}
> com,toscrape,quotes)/tag/truth/ 20241007081523804 {...}

Requests and Responses

Special Keys in Request.meta

The Request.meta attribute in Scrapy allows you to store arbitrary data for use during the crawling process. While you can store any custom data in this attribute, Scrapy and its built-in extensions recognize certain special keys. Additionally, the scrapy-webarchive extension introduces its own special key for managing metadata. Below is a description of the key used by scrapy-webarchive:

webarchive_warc

`webarchive_warc`

This key stores the result of a WACZ crawl or export. The data associated with this key is read-only and is not used to control Scrapy's behavior. The value of this key can be accessed using the constant WEBARCHIVE_META_KEY, but direct usage of this constant is discouraged. Instead, you should use the provided class method to instantiate a metadata object, as shown in the example below:

my_wacz_spider.py

from scrapy.spiders import Spider
from scrapy_webarchive.models import WarcMetadata


class MyWaczSpider(Spider):
    name = "myspider"

    def parse_function(self, response):
        # Instantiate a WarcMetadata object from the response
        warc_meta = WarcMetadata.from_response(response)

        # Extract the attributes to attach while parsing a page/item
        if warc_meta:
            yield {
                'warc_record_id': warc_meta.record_id,
                'wacz_uri': warc_meta.wacz_uri,
            }

Extending

Custom strategies

You can extend scrapy-webarchive by adding your own custom file lookup strategy. This allows you to define a custom way to select files based on available file information (URI and last modified timestamp).

To create a new strategy, define a class that implements the FileLookupStrategy interface and register it with StrategyRegistry.

strategies.py

from typing import List, Optional

from scrapy_webarchive.models import FileInfo
from scrapy_webarchive.strategies import StrategyRegistry


@StrategyRegistry.register("custom")
class CustomStrategy:
    def find(self, files: List[FileInfo], target_time: float) -> Optional[str]:
        # Logic goes here, should return a single URI or None. 
        # For examples see scrapy_webarchive.strategies.

Once registered, you can use your strategy by setting SW_WACZ_LOOKUP_STRATEGY in your Scrapy settings:

settings.py

SW_WACZ_LOOKUP_STRATEGY = "custom"

If you're defining it inside a strategies.py module within your Scrapy project, it will be automatically discovered. Alternatively, you can imported it somewhere in your project, such as in your settings.py or middlewares.py:

settings.py

import my_project.custom_strategies