Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last task of radiozamaneh.com_persian takes too long #1207

Open
benoit74 opened this issue Nov 14, 2024 · 1 comment
Open

Last task of radiozamaneh.com_persian takes too long #1207

benoit74 opened this issue Nov 14, 2024 · 1 comment
Labels
Bug Something isn't working

Comments

@benoit74
Copy link
Contributor

Recipe URL

https://farm.openzim.org/recipes/radiozamaneh.com_persian

Task URL

https://farm.openzim.org/pipeline/38678175-a393-4a75-a172-0022cd2f863f

Details

Every page seems to follow the same pattern:

{"timestamp":"2024-11-14T11:49:21.669Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.radiozamaneh.com/701411/"}}
{"timestamp":"2024-11-14T11:49:21.670Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":2368,"total":6360,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-11-14T11:49:21.669Z\",\"extraHops\":0,\"url\":\"https:\\/\\/www.radiozamaneh.com\\/701411\\/\",\"added\":\"2024-11-13T12:55:31.241Z\",\"depth\":3}"]}}
{"timestamp":"2024-11-14T11:49:22.123Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.radiozamaneh.com/701411/","workerid":0}}
{"timestamp":"2024-11-14T11:49:24.084Z","logLevel":"info","context":"behavior","message":"Running behaviors","details":{"frames":1,"frameUrls":["https://www.radiozamaneh.com/701411/"],"page":"https://www.radiozamaneh.com/701411/","workerid":0}}
{"timestamp":"2024-11-14T11:49:24.084Z","logLevel":"info","context":"behavior","message":"Run Script Started","details":{"frameUrl":"https://www.radiozamaneh.com/701411/","page":"https://www.radiozamaneh.com/701411/","workerid":0}}
{"timestamp":"2024-11-14T11:50:40.531Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://www.radiozamaneh.info/wp-json/wp/v2/comments","errorText":"net::ERR_CONNECTION_TIMED_OUT","type":"Fetch","status":0,"page":"https://www.radiozamaneh.com/701411/","workerid":0}}
{"timestamp":"2024-11-14T11:50:40.531Z","logLevel":"info","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"https://www.radiozamaneh.com/701411/","page":"https://www.radiozamaneh.com/701411/","workerid":0}}
{"timestamp":"2024-11-14T11:50:40.531Z","logLevel":"info","context":"behavior","message":"Behaviors finished","details":{"finished":1,"page":"https://www.radiozamaneh.com/701411/","workerid":0}}
{"timestamp":"2024-11-14T11:50:41.533Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"https://www.radiozamaneh.com/701411/","workerid":0}}

I.e. it always has ERR_CONNECTION_TIMED_OUT on https://www.radiozamaneh.info/wp-json/wp/v2/comments. And it takes more than one minute to timeout, so the crawl is very very slow.

It is important to note that this URL is not a page but a sub-resource of the current page. It can hence only blocked by --blockRules ... which are not yet exposed by the scraper (see openzim/zimit#433).

@benoit74 benoit74 added the Bug Something isn't working label Nov 14, 2024
@benoit74
Copy link
Contributor Author

Nota: I cancelled the task and disabled the recipe for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant