-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: describe usage for different middlewares and extension
settings: add prefix for namespaced settings
- Loading branch information
Wesley van Lee
committed
Oct 14, 2024
1 parent
8af1209
commit 8a58867
Showing
9 changed files
with
104 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,37 +1,45 @@ | ||
# Settings | ||
|
||
`scrapy-webarchive` makes use of the following settings, in addition to Scrapy's settings: | ||
`scrapy-webarchive` makes use of the following settings, in addition to Scrapy's settings. Note that all the settings are prefixed with `SW_`. | ||
|
||
## Extensions | ||
|
||
### `ARCHIVE_EXPORT_URI` | ||
### `SW_EXPORT_URI` | ||
|
||
```python | ||
ARCHIVE_EXPORT_URI = "s3://scrapy-webarchive/" | ||
ARCHIVE_EXPORT_URI = "s3://scrapy-webarchive/{year}/{month}/{day}/" | ||
SW_EXPORT_URI = "s3://scrapy-webarchive/" | ||
SW_EXPORT_URI = "s3://scrapy-webarchive/{year}/{month}/{day}/" | ||
``` | ||
|
||
This is the output path of the WACZ file. Multiple variables can be added that allow dynamic generation of the output path. | ||
|
||
Supported variables: `year`, `month`, `day` and `timestamp`. | ||
|
||
## Downloader middleware | ||
## Downloader middleware and spider middleware | ||
|
||
### `WACZ_SOURCE_URL` | ||
### `SW_WACZ_SOURCE_URL` | ||
|
||
```python | ||
WACZ_SOURCE_URL = "s3://scrapy-webarchive/archive.wacz" | ||
SW_WACZ_SOURCE_URL = "s3://scrapy-webarchive/archive.wacz" | ||
|
||
# Allows multiple sources, comma seperated. | ||
WACZ_SOURCE_URL = "s3://scrapy-webarchive/archive.wacz,/path/to/archive.wacz" | ||
SW_WACZ_SOURCE_URL = "s3://scrapy-webarchive/archive.wacz,/path/to/archive.wacz" | ||
``` | ||
|
||
This setting defines the location of the WACZ file that should be used as a source for the crawl job. | ||
|
||
### `WACZ_CRAWL` | ||
### `SW_WACZ_CRAWL` | ||
|
||
```python | ||
WACZ_CRAWL = True | ||
SW_WACZ_CRAWL = True | ||
``` | ||
|
||
Setting to ignore original `start_requests`, just yield all responses found. | ||
|
||
### `SW_WACZ_TIMEOUT` | ||
|
||
```python | ||
SW_WACZ_TIMEOUT = 60 | ||
``` | ||
|
||
Transport parameter for retrieving the `SW_WACZ_SOURCE_URL` from the defined location. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# Usage | ||
|
||
## Exporting | ||
|
||
### Exporting a WACZ archive | ||
|
||
To archive the requests/responses during a crawl job you need to enable the `WaczExporter` extension. | ||
|
||
```python | ||
EXTENSIONS = { | ||
"scrapy_webarchive.extensions.WaczExporter": 543, | ||
} | ||
``` | ||
|
||
This extension also requires you to set the export location using the `SW_EXPORT_URI` settings. | ||
|
||
```python | ||
SW_EXPORT_URI = "s3://scrapy-webarchive/" | ||
``` | ||
|
||
Running a crawl job using these settings will result in a newly created WACZ file. | ||
|
||
## Crawling | ||
|
||
There are 2 ways to crawl against a WACZ archive. Choose a strategy that you want to use for your crawl job, and follow the instruction as described below. Using both strategies at the same time is not allowed. | ||
|
||
## Lookup in a WACZ archive | ||
|
||
One of the ways to crawl against a WACZ archive is to use the `WaczMiddleware` downloader middleware. Instead of fetching the live resource the middleware will instead retrieve it from the archive and recreate a response using the data from the archive. | ||
|
||
To use the downloader middleware, enable it in the settings like so: | ||
|
||
```python | ||
DOWNLOADER_MIDDLEWARES = { | ||
"scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543, | ||
} | ||
``` | ||
|
||
Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URL` setting: | ||
|
||
```python | ||
SW_WACZ_SOURCE_URL = "s3://scrapy-webarchive/archive.wacz" | ||
``` | ||
|
||
## Iterating a WACZ archive | ||
|
||
Going around the default behaviour of the spider, the `WaczCrawlMiddleware` spider middleware will, when enabled, replace the crawl by an iteration through all the entries in the WACZ archive. | ||
|
||
To use the spider middleware, enable it in the settings like so: | ||
|
||
```python | ||
SPIDER_MIDDLEWARES = { | ||
"scrapy_webarchive.middleware.WaczCrawlMiddleware": 532, | ||
} | ||
``` | ||
|
||
Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URL` setting: | ||
|
||
```python | ||
SW_WACZ_SOURCE_URL = "s3://scrapy-webarchive/archive.wacz" | ||
SW_WACZ_CRAWL = True | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters