CherioCrawler not working "allow running single crawler instance multiple times" #2637

distributev · 2024-08-26T22:47:17Z

distributev
Aug 26, 2024

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/cheerio (CheerioCrawler)

Issue description

I believe this is expected to work but it does not

allow running single crawler instance multiple times
#1844

If I try to run() in a loop the first iteration works fine but all the subsequent iterations display.

2024-08-26T22:00:05.502Z INFO CheerioCrawler: Starting the crawler.
2024-08-26T22:00:05.576Z INFO CheerioCrawler: All requests from the queue have been processed, the crawler will shut down.
2024-08-26T22:00:05.783Z INFO CheerioCrawler: Final request statistics: {"requestsFinished":0,"requestsFailed":0,"retryHistogram":[],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":451}

Code sample

for (let i=0;i<100;i++) {
  console.time(`RUN (${i}) crawler.run`);
  
  await crawler.run(urls);
  
  await new Promise(resolve => setTimeout(resolve, 1000));
  
  console.timeLog(`RUN (${i}) crawler.run`);
}



### Package version

3.11.1

### Node.js version

20

### Operating system

_No response_

### Apify platform

- [X] Tick me if you encountered this issue on the Apify platform

### I have tested this on the `next` release

_No response_

### Other context

_No response_

B4nan · 2024-08-27T07:07:05Z

B4nan
Aug 27, 2024
Maintainer

As you can see in the PR you mentioned, this behavior is covered by tests, and those are still passing. I guess it can be something about your specific setup, please provide a complete reproduction.

Also note that this is generally a problem of reusing the same default storages. You can disable storage persistence for your crawler to get around it. This way you will be also able to run the crawlers in parallel, which I can imagine is something you are already doing, but not sharing it as part of the reproduction - that is otherwise not possible without the disabled persistence.

https://crawlee.dev/api/core/interface/ConfigurationOptions#persistStorage

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CherioCrawler not working "allow running single crawler instance multiple times" #2637

{{title}}

Replies: 1 comment

{{title}}

Select a reply

CherioCrawler not working "allow running single crawler instance multiple times" #2637

distributev Aug 26, 2024

Which package is this bug report for? If unsure which one to select, leave blank

Issue description

Code sample

Replies: 1 comment

B4nan Aug 27, 2024 Maintainer

distributev
Aug 26, 2024

B4nan
Aug 27, 2024
Maintainer