Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions docs/guides/avoid_blocking.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ title: Avoid getting blocked
description: How to avoid getting blocked when scraping
---

import ApiLink from '@site/src/components/ApiLink';

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import CodeBlock from '@theme/CodeBlock';
Expand All @@ -18,6 +20,8 @@ A scraper might get blocked for numerous reasons. Let's narrow it down to the tw

Browser fingerprint is a collection of browser attributes and significant features that can show if our browser is a bot or a real user. Moreover, most browsers have these unique features that allow the website to track the browser even within different IP addresses. This is the main reason why scrapers should change browser fingerprints while doing browser-based scraping. In return, it should significantly reduce the blocking.

The two are not handled separately. In Crawlee a <ApiLink to="core/class/Session">`Session`</ApiLink> ties an IP, a cookie jar, and a fingerprint together into one consistent identity, and the <ApiLink to="core/class/SessionPool">`SessionPool`</ApiLink> rotates those identities as a unit — so a fresh fingerprint always arrives with a fresh IP. This guide covers the fingerprint half; see the [session management guide](./session-management) for how to control the rotation, and the [proxy management guide](./proxy-management) for the IP half.

## Using browser fingerprints

Changing browser fingerprints can be a tedious job. Luckily, Crawlee provides this feature with zero configuration necessary - the usage of fingerprints is enabled by default and available in `PlaywrightCrawler` and `PuppeteerCrawler`. So whenever we build a scraper that is using one of these crawlers - the fingerprints are going to be generated for the default browser and the operating system out of the box.
Expand Down Expand Up @@ -56,6 +60,35 @@ On the contrary, sometimes we want to entirely disable the usage of browser fing
</TabItem>
</Tabs>

## Fingerprints for HTTP crawlers

Every session carries a lightweight fingerprint hint — a `browser`, `platform`, and `device` triple — that the request's HTTP client receives and applies on a best-effort basis.
By default each session is given a realistic, randomized fingerprint (the host operating system as `platform`, with a plausible `browser`/`device` for it), and it rotates with the session just like the IP and cookies do.

How much of the hint is used depends on the client. The [`impit`](impit-http-client) HTTP client maps the session's `browser` hint to a matching TLS and HTTP impersonation profile,
so the connection's low-level signature lines up with the headers being sent.

The *same* hint also drives browser crawlers, where it seeds the generated browser fingerprint. The hint only fixes the broad strokes — the browser family, operating system, and device — so a session presents a coherent profile, but it does not make the two backends produce byte-identical fingerprints: `impit` and a real browser will still differ in the finer details (a slightly different user-agent string, for example).

You can pin the fingerprint explicitly through `sessionOptions` when you need a specific profile:

```js
import { CheerioCrawler, SessionPool } from 'crawlee';
import { ImpitHttpClient } from '@crawlee/impit-client';

const crawler = new CheerioCrawler({
httpClient: new ImpitHttpClient(),
sessionPool: new SessionPool({
sessionOptions: {
fingerprint: { browser: 'firefox', platform: 'windows', device: 'desktop' },
},
}),
requestHandler: async ({ $ }) => {
// requests impersonate desktop Firefox on Windows
},
});
```

## Camoufox

For some protections, using our integrated solutions is not enough, one example could be the Cloudflare challenge. For such pages, you can try [Camoufox](https://camoufox.com/), a custom stealthy build of Firefox for web scraping. It might not get you through the challenge automatically, but with our `handleCloudflareChallengeHook` post-navigation hook, it should be able to successfully mimic the required user action and get you through it. The hook also reloads the page after the challenge clears and propagates the fresh response back into the crawling context.
Expand Down
2 changes: 0 additions & 2 deletions docs/guides/impit-http-client/impit-http-client.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,6 @@ import CheerioCrawlerSource from '!!raw-loader!./cheerio-crawler.ts';
import HttpCrawlerSource from '!!raw-loader!./http-crawler.ts';
import AdvancedConfigSource from '!!raw-loader!./advanced-config.ts';

## Introduction

The `ImpitHttpClient` is an HTTP client implementation based on the [Impit](https://github.com/apify/impit) library. It enables browser impersonation for HTTP requests, helping you bypass bot detection systems without running an actual browser.

:::info Successor to got-scraping
Expand Down
Loading
Loading