Skip to content

feat: connect to remote browser services#3545

Draft
l2ysho wants to merge 41 commits into
v4from
1822-connect-to-remote-browser-services
Draft

feat: connect to remote browser services#3545
l2ysho wants to merge 41 commits into
v4from
1822-connect-to-remote-browser-services

Conversation

@l2ysho

@l2ysho l2ysho commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

This is still WIP, but I will left comment on places which are worth to look at.

l2ysho and others added 15 commits March 18, 2026 14:46
… support

# Task 1: Type Definitions & LaunchContext `isRemote` Flag

## Goal

Add the foundational types and the `isRemote` flag that all other remote browser tasks depend on.

## Dependencies

None — this is the foundation task.

## Scope

### 1. Add `isRemote` to `LaunchContext`

**File:** `packages/browser-pool/src/launch-context.ts`

- Add `isRemote?: boolean` to the `LaunchContextOptions` interface (alongside `id`, `browserPlugin`, etc.)
- Add a public readonly `isRemote: boolean` property to the `LaunchContext` class
- Set it from constructor options, defaulting to `false`

### 2. Define connect option types on PlaywrightPlugin

**File:** `packages/browser-pool/src/playwright/playwright-plugin.ts`

Add the following type to the plugin file (or a co-located types file):

```typescript
// Mirrors browserType.connectOverCDP(endpointURL, options)
interface PlaywrightConnectOverCDPOptions {
    endpointURL: string;
    options?: Parameters<BrowserType['connectOverCDP']>[1];
}

// Mirrors browserType.connect(wsEndpoint, options)
interface PlaywrightConnectOptions {
    wsEndpoint: string;
    options?: Parameters<BrowserType['connect']>[1];
}
```

Use the existing `Parameters` utility type pattern (see how `SafeParameters` is used elsewhere in the codebase) — do NOT redefine Playwright's types manually.

### 3. Define connect option types on PuppeteerPlugin

**File:** `packages/browser-pool/src/puppeteer/puppeteer-plugin.ts`

```typescript
// Mirrors puppeteer.connect({ browserWSEndpoint, ...rest })
// Flat object matching Puppeteer's ConnectOptions
type PuppeteerConnectOverCDPOptions = Parameters<typeof puppeteer.connect>[0];
```

Use the `Parameters` pattern to extract the type from Puppeteer's `connect` method.

### 4. Add connect option fields to `BrowserPluginOptions`

**File:** `packages/browser-pool/src/abstract-classes/browser-plugin.ts`

This is a design choice — the PRD says connect options live on the plugin subclass, not on `LaunchContext`. Add the fields to the plugin options type so they flow through the constructor:

- `PlaywrightPlugin` options should accept `connectOptions?` and `connectOverCDPOptions?`
- `PuppeteerPlugin` options should accept `connectOverCDPOptions?`

These can be added to subclass-specific option types rather than the base `BrowserPluginOptions`.

### 5. Add connect option fields to launcher-level interfaces

**File:** `packages/playwright-crawler/src/internals/playwright-launcher.ts`

Add to `PlaywrightLaunchContext`:
```typescript
connectOptions?: PlaywrightConnectOptions;
connectOverCDPOptions?: PlaywrightConnectOverCDPOptions;
```

**File:** `packages/puppeteer-crawler/src/internals/puppeteer-launcher.ts`

Add to `PuppeteerLaunchContext`:
```typescript
connectOverCDPOptions?: PuppeteerConnectOverCDPOptions;
```

This enables IDE autocomplete when users configure `launchContext` on the crawler.

### 6. Export new types

**File:** `packages/browser-pool/src/index.ts`

Export the new connect option types so they're available to consumers.

## Key Files

| File | Change |
|------|--------|
| `packages/browser-pool/src/launch-context.ts` | Add `isRemote` option + property |
| `packages/browser-pool/src/playwright/playwright-plugin.ts` | Add connect option types |
| `packages/browser-pool/src/puppeteer/puppeteer-plugin.ts` | Add connect option type |
| `packages/playwright-crawler/src/internals/playwright-launcher.ts` | Add connect options to `PlaywrightLaunchContext` |
| `packages/puppeteer-crawler/src/internals/puppeteer-launcher.ts` | Add connect options to `PuppeteerLaunchContext` |
| `packages/browser-pool/src/index.ts` | Export new types |
| `packages/browser-crawler/src/internals/browser-launcher.ts` | May need connect options on `BrowserLaunchContext` base |

## Acceptance Criteria

- [x] `LaunchContext` has `isRemote` boolean property, defaults to `false`
- [x] Connect option types are defined using library `Parameters` extraction (not manual redefinition)
- [x] `PlaywrightLaunchContext` shows `connectOptions` and `connectOverCDPOptions` in IDE autocomplete
- [x] `PuppeteerLaunchContext` shows `connectOverCDPOptions` in IDE autocomplete
- [x] New types are exported from `@crawlee/browser-pool`
- [x] TypeScript compiles with no errors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…and `connectOverCDP()`

# Task 2: PlaywrightPlugin Remote Connection Routing

## Goal

Make `PlaywrightPlugin._launch()` branch to `connect()` or `connectOverCDP()` when remote connection options are present, instead of calling `launch()`.

## Dependencies

- Task 1 (types and `isRemote` flag)

## Scope

### 1. Store connect options on the plugin instance

**File:** `packages/browser-pool/src/playwright/playwright-plugin.ts`

- Accept `connectOptions` and `connectOverCDPOptions` in the constructor options
- Store them as instance properties
- **Validation:** If both `connectOptions` AND `connectOverCDPOptions` are provided, throw an error immediately in the constructor:
  ```
  Cannot set both 'connectOptions' and 'connectOverCDPOptions' — pick one protocol.
  ```

### 2. Branch in `_launch()` for remote connections

**File:** `packages/browser-pool/src/playwright/playwright-plugin.ts`

In the existing `_launch()` method (currently lines 22-102), add branching logic **before** the existing local launch code:

```typescript
protected async _launch(launchContext: LaunchContext<...>): Promise<Browser> {
    // Remote CDP connection
    if (this.connectOverCDPOptions) {
        const { endpointURL, options } = this.connectOverCDPOptions;
        const browser = await browserType.connectOverCDP(endpointURL, options);
        return browser;
    }

    // Remote Playwright WebSocket connection
    if (this.connectOptions) {
        const { wsEndpoint, options } = this.connectOptions;
        const browser = await browserType.connect(wsEndpoint, options);
        return browser;
    }

    // Existing local launch logic...
}
```

**Reference:** See `StagehandPlugin._launch()` at `packages/stagehand-crawler/src/internals/stagehand-plugin.ts:102-107` for the CDP connection pattern:
```typescript
const cdpUrl = await stagehand.connectURL();
const browser = await chromium.connectOverCDP(cdpUrl);
```

### 3. Set `isRemote` on LaunchContext

**File:** `packages/browser-pool/src/playwright/playwright-plugin.ts`

In `createLaunchContext()` (or wherever the plugin creates the LaunchContext), pass `isRemote: true` when connect options are present. This can be done by overriding `createLaunchContext()` in the subclass, or by passing it through the options.

Check how the base `BrowserPlugin.createLaunchContext()` works (at `packages/browser-pool/src/abstract-classes/browser-plugin.ts:149-174`) and determine the best insertion point.

## Key Design Decisions

- **No new abstract method:** The routing happens inside `_launch()` via internal branching, not a new `_connect()` method. This keeps the abstract interface unchanged and doesn't affect custom plugins like StagehandPlugin.
- **`browser.close()` for cleanup:** Remote browsers are closed the same way as local browsers — via `browser.close()`. No special disconnect handling.
- **No proxy server setup for remote:** The remote branch skips the local proxy server setup that exists in the current `_launch()` code.

## Key Files

| File | Change |
|------|--------|
| `packages/browser-pool/src/playwright/playwright-plugin.ts` | Constructor stores options, `_launch()` branches for remote |

## Acceptance Criteria

- [x] `PlaywrightPlugin` accepts `connectOptions` in constructor and calls `browserType.connect()` with `wsEndpoint` and `options`
- [x] `PlaywrightPlugin` accepts `connectOverCDPOptions` in constructor and calls `browserType.connectOverCDP()` with `endpointURL` and `options`
- [x] Setting both `connectOptions` and `connectOverCDPOptions` throws an error
- [x] `launchContext.isRemote` is `true` when connect options are present
- [x] Remote branch skips local proxy server setup and persistent context logic
- [x] TypeScript compiles with no errors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nnect()`

# Task 3: PuppeteerPlugin Remote Connection Routing

## Goal

Make `PuppeteerPlugin._launch()` branch to `puppeteer.connect()` when remote connection options (CDP) are present, instead of calling `puppeteer.launch()`.

## Dependencies

- Task 1 (types and `isRemote` flag)

## Scope

### 1. Store connect options on the plugin instance

**File:** `packages/browser-pool/src/puppeteer/puppeteer-plugin.ts`

- Accept `connectOverCDPOptions` in the constructor options
- Store as an instance property
- Puppeteer only supports CDP — there is no `connectOptions` field (Playwright-only)

### 2. Branch in `_launch()` for remote connections

**File:** `packages/browser-pool/src/puppeteer/puppeteer-plugin.ts`

In the existing `_launch()` method (currently lines 22-203), add branching logic **before** the existing local launch code:

```typescript
protected async _launch(launchContext: LaunchContext<...>): Promise<Browser> {
    // Remote CDP connection
    if (this.connectOverCDPOptions) {
        const browser = await puppeteer.connect(this.connectOverCDPOptions);
        // Wrap with the same Proxy handler for newPage() interception
        // (see existing code at lines 138-200)
        return wrappedBrowser;
    }

    // Existing local launch logic...
}
```

**Important:** Puppeteer's `connect()` takes a flat options object: `puppeteer.connect({ browserWSEndpoint, ...rest })`. This is different from Playwright's two-argument pattern. The type should match Puppeteer's `ConnectOptions`.

### 3. Handle the `newPage()` Proxy wrapper for remote

The existing `_launch()` wraps the browser in a `Proxy` that intercepts `newPage()` calls to support `useIncognitoPages` (lines 138-200). This proxy wrapper should also be applied to remote browsers so that incognito context creation works correctly.

### 4. Set `isRemote` on LaunchContext

Same pattern as Task 2 — pass `isRemote: true` when `connectOverCDPOptions` is present.

## Key Design Decisions

- **Flat options object:** Puppeteer's `connect()` API takes a single options object (not `endpointURL, options` like Playwright). The `connectOverCDPOptions` type matches this flat shape directly.
- **`browser.close()` for cleanup:** Same as Playwright — remote browsers closed via `browser.close()`, not `browser.disconnect()`.
- **`newPage()` proxy still needed:** The Proxy wrapper that intercepts `newPage()` to create incognito contexts must still wrap remote browsers.

## Key Files

| File | Change |
|------|--------|
| `packages/browser-pool/src/puppeteer/puppeteer-plugin.ts` | Constructor stores options, `_launch()` branches for remote |

## Acceptance Criteria

- [x] `PuppeteerPlugin` accepts `connectOverCDPOptions` in constructor and calls `puppeteer.connect()` with the options object
- [x] The `newPage()` Proxy wrapper is applied to remote browsers (for incognito support)
- [x] `launchContext.isRemote` is `true` when connect options are present
- [x] Remote branch skips user data directory setup, headless handling, and other local-only logic
- [x] TypeScript compiles with no errors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nection logging

## Goal

Make `BrowserPlugin.launch()` skip proxy injection and webdriver hiding when `launchContext.isRemote` is `true`, since these operations modify `launchOptions` which are not used for remote connections.

## Dependencies

- Task 1 (`isRemote` flag on LaunchContext)

## Scope

### 1. Skip `_addProxyToLaunchOptions()` for remote

**File:** `packages/browser-pool/src/abstract-classes/browser-plugin.ts`

In the `launch()` method, the call to `_addProxyToLaunchOptions()` is now gated on `!isRemote`:

```typescript
if (launchContext.proxyUrl && !launchContext.isRemote) {
    await this._addProxyToLaunchOptions(launchContext);
}
```

### 2. Skip `_mergeArgsToHideWebdriver()` for remote

```typescript
if (!launchContext.isRemote && this._isChromiumBasedBrowser(launchContext)) {
    this._mergeArgsToHideWebdriver(launchContext);
}
```

### 3. No changes to `_addProxyToLaunchOptions()` or `_mergeArgsToHideWebdriver()` themselves

The methods remain unchanged — the skip logic lives in the calling `launch()` method.

## Key Design Decisions

- **Skip at call site, not in the methods**
- **`proxyUrl` + remote triggers a warning:** Handled in Task 6 (Warnings)
- **Fingerprinting hooks are unchanged**

## Additional

- Fixed `isRemote` not being passed through base class `createLaunchContext()`
- Added info-level logs for remote connections in base class and both plugins

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ht overloads

Playwright: change PlaywrightConnectOverCDPOptions and PlaywrightConnectOptions
from type aliases (all-optional fields) to interfaces with required `wsEndpoint`.
Use the non-deprecated two-argument overloads in _launch().

Puppeteer: add runtime guard that throws if neither `browserWSEndpoint` nor
`browserURL` is provided in connectOverCDPOptions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tions

# Task 5: `useIncognitoPages` Defaults to `true` for Remote

## Goal

When remote connection options are present and `useIncognitoPages` was not explicitly set by the user, default it to `true` and log an info message. If the user explicitly sets `false`, log a warning.

## Dependencies

- Task 2 (PlaywrightPlugin stores connect options)
- Task 3 (PuppeteerPlugin stores connect options)

## Scope

### 1. Preserve `undefined` vs `false` in base constructor

The base `BrowserPlugin` constructor currently collapses `useIncognitoPages` to `false`. The subclass checks `options.useIncognitoPages` directly (preserves `undefined`) and overrides after `super()`.

### 2. Override default in PlaywrightPlugin constructor

After the `super()` call, if connect options are present:

- `undefined` → set to `true`, info log
- `false` → warning log
- `true` → no extra log

### 3. Override default in PuppeteerPlugin constructor

Same logic, checking `connectOverCDPOptions`.

## Key Design Decisions

- **Info vs warning:** Defaulting to `true` is an info message (expected behavior). Explicit `false` is a warning (user should understand implications).
- **`useIncognitoPages: false` + `connect()` is not special-cased:** The warning covers this case — no additional error or fallback.
- **Uses existing `this.log`:** All logging uses the inherited `BrowserPlugin.log` logger.

## Acceptance Criteria

- [x] When `connectOptions` or `connectOverCDPOptions` is set and `useIncognitoPages` is not provided → defaults to `true`, info message logged
- [x] When `connectOptions` or `connectOverCDPOptions` is set and `useIncognitoPages: false` → stays `false`, warning logged
- [x] When `connectOptions` or `connectOverCDPOptions` is set and `useIncognitoPages: true` → stays `true`, no extra log
- [x] When no connect options are set → existing behavior unchanged
- [x] Base constructor preserves `undefined` vs `false` distinction

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename PlaywrightConnectOverCDPOptions.wsEndpoint → endpointURL to match
  Playwright's own terminology and avoid field conflict with inherited
  ConnectOverCDPOptions.endpointURL
- Wrap connectOverCDP() and connect() failures with BrowserLaunchError
  including sanitized endpoint URL (credentials stripped) and actionable
  guidance
- Move endpoint validation to constructors (fail fast) — Playwright validates
  endpointURL and wsEndpoint are non-empty, Puppeteer validates
  browserWSEndpoint || browserURL
- Add _sanitizeEndpointForLog() to both plugins to strip credentials from
  URLs before including them in error messages

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ions

- Close BrowserContext on page close when useIncognitoPages is true.
  Previously contexts were only cleaned up when an anonymized proxy was
  active, causing context accumulation on remote browsers without proxy.
- Clean up targetcreated listener on remote browser disconnect via
  browser.once('disconnected') handler to prevent listener leaks.
- Guard anonymizeProxySugar call with proxyUrl check — skip the async
  call entirely when no proxy is configured (common for remote browsers).
- Conditionally omit proxyServer from context options when no proxy is
  set, instead of passing { proxyServer: undefined }.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ket connections

- Add comments in both plugin constructors explaining why
  options.useIncognitoPages is checked instead of this.useIncognitoPages
  (super() collapses undefined to false, losing the "not set" signal).
- Strengthen warning for Playwright connectOptions (WebSocket) +
  useIncognitoPages: false — connect() returns a browser with no default
  context, which is more severe than just sharing cookies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove spurious launchOptions warning that always fired due to
framework-injected defaults, and share log instances in launchers.

PRD Task 6: Warnings for Ignored & Conflicting Options
- proxyUrl + remote → warning in base BrowserPlugin.launch()
- useChrome + remote → warning in launcher constructors
- executablePath + remote → warning in launcher constructors
- useIncognitoPages: false + remote → handled by Task 5

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PRD Task 7: Unit Tests
- Connection routing (Playwright CDP/WS/local, Puppeteer CDP/local)
- Validation (mutual exclusion, missing endpoints)
- isRemote correctness for all plugin variants
- Proxy/webdriver skipping for remote, applied for local
- useIncognitoPages defaults (true for remote, false for local)
- Warnings (proxyUrl, useIncognitoPages: false, CDP vs WS variants)
- 40 tests, all mocked (no real browser instances)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…texts

When useIncognitoPages is true (default for remote) and proxyUrl is set,
the newPage handler was passing proxyServer to createBrowserContext even
for remote connections. For credentialed proxies this also spun up a
localhost tunnel unreachable by the remote browser.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Examples for Browserbase, Browserless, Rebrowser, and Steel using Playwright and Puppeteer.
@l2ysho l2ysho changed the title 1822 connect to remote browser services feat: connect to remote browser services Apr 21, 2026
l2ysho added 3 commits April 27, 2026 11:33
…oteBrowser config

Add a unified API for connecting crawlers to remote browser services
  (Browserbase, Browserless, Steel, Rebrowser). Users can either pass a
  RemoteBrowserConfig object or extend RemoteBrowserProvider with typed
  connect()/release() lifecycle methods.

  - Add RemoteBrowserProvider abstract class with generic TContext
  - Add RemoteBrowserConfig interface (endpoint + release + type)
  - Wire remoteBrowser through BrowserPlugin, PlaywrightPlugin, PuppeteerPlugin
  - Auto-call release() on browser close/crash/pool destroy
  - Skip fingerprinting, proxy injection, and webdriver stealth for remote browsers
  - Skip session-based browser retirement for remote browsers (isRemote guard)
  - Default useIncognitoPages to true for remote connections
  - Add 30+ unit tests for both config and provider patterns
  - Update all temp-examples to use RemoteBrowserProvider
… overflow

Remote browser services enforce concurrent session limits. During browser
  retirement transitions, the pool could briefly exceed the limit by launching
  a new browser before the retired one fully closed.

  - Add maxOpenBrowsers to RemoteBrowserConfig and RemoteBrowserProvider
  - BrowserCrawler reads it from the plugin and applies it to the pool
  - Gate new tasks via _isTaskReadyFunction (same pattern as maxConcurrency)
  - Add hasFreeBrowserSlot() and hasActiveBrowserWithFreeCapacity() to BrowserPool
  - Only activates when maxOpenBrowsers is set (remote browsers); local browsers unaffected
Comment thread packages/browser-crawler/src/internals/browser-crawler.ts Outdated
* }
* ```
*/
export abstract class RemoteBrowserProvider<TContext extends Record<string, unknown> = Record<string, unknown>> {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Browserless is good example of minimal implementation

temp-examples/examples/browserless-playwright.ts

const endpointUrl = `wss://production-sfo.browserless.io?token=${token}`

class BrowserlessProvider extends RemoteBrowserProvider {
    async connect() {
        return { url: endpointUrl };
    }
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand browserbase need full implementation

temp-examples/examples/browserbase-playwright.ts

class BrowserbaseWsProvider extends RemoteBrowserProvider<{ id: string }> {
   type = 'websocket' as const;
   maxOpenBrowsers = 2;

    async connect() {
        const response = await fetch('https://api.browserbase.com/v1/sessions', {
            method: 'POST',
            headers: { 'x-bb-api-key': apiKey, 'Content-Type': 'application/json' },
            body: JSON.stringify({ projectId }),
        });

        if (!response.ok) {
            throw new Error(`Failed to create Browserbase session: ${response.status} ${response.statusText}`);
        }

        const session = await response.json();

        const url = `wss://connect.browserbase.com?apiKey=${apiKey}&sessionId=${session.id}`;
        return { url, context: { id: session.id } };
    }

    async release({ id }: { id: string }) {
        await fetch(`https://api.browserbase.com/v1/sessions/${id}`, {
            method: 'POST',
            headers: { 'x-bb-api-key': apiKey, 'Content-Type': 'application/json' },
            body: JSON.stringify({ status: 'REQUEST_RELEASE' }),
        })
    }
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a separate package (@crawlee/remote or similar) with these implementations for major remote browser providers?

So our users can do

import { BrowserbaseBackend } from '@crawlee/remote';
import { PlaywrightCrawler } from '@crawlee/playwright';

new PlaywrightCrawler({
    launchContext: { 
        remoteBrowser: new BrowserbaseBackend({ url: '', token: '', ...})
    }
})

Imo this could be really useful. Definitely not necessary to add in this PR, though, it looks pretty self-contained.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere in process we dropped this idea, but I am open to create an issue. In general (at least for services I worked with) setup is similar service to service. Maybe also this class BrowserbaseWsProvider extends RemoteBrowserProvider is too much boilerplate for what is really does.

*
* @param _context The same `context` object returned by {@link connect}.
*/
async release(_context: TContext): Promise<void> {}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am aware that this is not the best name for it

l2ysho added 4 commits April 30, 2026 15:43
… cookie sharing

Remote CDP browsers (both Puppeteer and Playwright) now default useIncognitoPages
  to false, matching local behavior. For Playwright CDP, the browser's default context
  is wrapped in PlaywrightBrowserWithPersistentContext so pages share cookies — the same
  mechanism used locally with launchPersistentContext().

  Playwright WebSocket still defaults to true since connect() returns a browser with
  no default context to wrap.

  The wrapper passes the real Browser as parentBrowser so close() also closes the
  WebSocket transport and disconnected events are forwarded to the pool.
Comment on lines +87 to +91
if (isWebSocket) {
this.useIncognitoPages = true;
this.log.info(
'Remote Playwright WebSocket connection detected — defaulting useIncognitoPages to true.',
);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

traditional WS does not return context (as CDP does) so no way to pass this context to another browser instance

Comment on lines +257 to +264
* When useIncognitoPages is false and we have a CDP-connected browser,
* wrap its default context in PlaywrightBrowser so that all pages share
* a single context (matching local persistent-context behavior).
*
* Playwright's browser.newPage() always creates a new context, so without
* this wrapper, pages would never share cookies even with useIncognitoPages: false.
*/
private _maybeWrapWithSharedContext(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locally this is handled by launchPersistentContext but here we have to grab context from remote browser

The proxyUrl from Crawlee's ProxyConfiguration is now passed to
  RemoteBrowserProvider.connect({ proxyUrl }) and RemoteBrowserConfig.endpoint({ proxyUrl }),
  letting provider implementations forward it to the remote service's proxy API
  (e.g. Browserless externalProxyServer, Browserbase external proxies).

  Also adds a userDataDir warning for remote connections, matching the existing
  proxyUrl warning pattern.
@janbuchar

Copy link
Copy Markdown
Contributor

I guess it shouldn't be a problem, but can you wire up https://gologin.com/docs/api-reference/sdks/nodejs-sdk into crawlee with this?

@barjin barjin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @l2ysho! (and sorry for putting the review away for this long 😅 )

The implementation seems mostly solid to me. During testing, I came across a few rough edges I'd like to discuss ⬇️

Comment on lines +169 to +171
const connectOptionsPresent = !!(launchContext.connectOptions || launchContext.connectOverCDPOptions);

if (connectOptionsPresent && (launchContext.useChrome || launchContext.launchOptions?.executablePath)) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should enforce mutual exclusivity between launchOptions and connectOptions on the type level?

Image

As a user, I don't know if PlaywrightCrawler will launch a local dark-theme browser, or will connect to the remote one (or a combination of these)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, it seems that different launchOptions are ignored when using connectOptions and connectOverCDPOptions (and some are ignored in both).

Maybe I'm being too rough, but I'd just enforce having at most one of launchOptions | connectOptions | connectOverCDPOptions for now. We can always relax this if the need ever arises.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, I am afraid there will be more funny stuff after my last 6 rewrites of this :] Lets throw when launchOptions is used with remote and do mutually exclusive connectOptions | connectOverCDPOptions | remoteBrowser (also throw)

Comment on lines +115 to +120
/**
* Connection type to use. `'cdp'` uses `browserType.connectOverCDP()`,
* `'websocket'` uses `browserType.connect()`.
* @default 'cdp'
*/
type?: 'cdp' | 'websocket';

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Options for connecting to a remote browser via WebSocket.

Both the Playwright's internal "client-server" protocol (no public name that I know of) and CDP are based on WebSocket (both are higher-level protocols).

Perhaps we can make this into type?: 'cdp' | 'playwright'?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I will drop that websocket completely, it is specific to playwright (and obsolete ?). cc @janbuchar

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Au contraire - according to the docs, the Playwright-specific protocol offers more detailed control over the browser. Especially since this is inside playwright-launcher.ts, we should imo keep it.

image

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I see the commit message now

Callers who genuinely need Playwright's connect() can still use connectOptions directly.

sounds reasonable 👍

* }
* ```
*/
export abstract class RemoteBrowserProvider<TContext extends Record<string, unknown> = Record<string, unknown>> {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a separate package (@crawlee/remote or similar) with these implementations for major remote browser providers?

So our users can do

import { BrowserbaseBackend } from '@crawlee/remote';
import { PlaywrightCrawler } from '@crawlee/playwright';

new PlaywrightCrawler({
    launchContext: { 
        remoteBrowser: new BrowserbaseBackend({ url: '', token: '', ...})
    }
})

Imo this could be really useful. Definitely not necessary to add in this PR, though, it looks pretty self-contained.

* }
* ```
*/
export interface RemoteBrowserConfig {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something we can express with RemoteBrowserConfig, that we cannot do with the other interfaces?

The RemoteBrowserConfig interface is IIUC halfway between the connectOptions object (simple) and the RemoteBrowserProvider class (for advanced use cases)... and I'm not convinced we need all three.

edit: perhaps we can keep using the ...Config internally, but export only ...Provider?

Comment on lines +251 to +261
protected _sanitizeEndpointForLog(endpoint: string): string {
try {
const url = new URL(endpoint);
if (url.username || url.password) {
url.username = '***';
url.password = '***';
}
return url.toString();
} catch {
return '<invalid URL>';
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some providers (e.g., Browserless) authenticate using a token in the query params (docs), so we'd leak the secret anyway.

It feels like this should be the logger's responsibility, and we shouldn't deal with this here. How about, e.g., logging just the hostname in the error messages?

this.log.warning(
'Both remoteBrowser and connectOverCDPOptions/connectOptions are set. ' +
'remoteBrowser is ignored when explicit connect options are provided.',
);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... which one is right? 😄

* Takes precedence over `connectOverCDPOptions` / `connectOptions` if both are set.

It seems that the comment there is wrong, and this error message is right, see example:

import { PlaywrightCrawler } from './packages/playwright-crawler/src/index.js';
import { chromium, firefox } from 'playwright';

const firefoxServer = await firefox.launchServer();

const crawler = new PlaywrightCrawler({
    launchContext: {
        remoteBrowser: {
            connect: async () => {
                const chromiumServer = await chromium.launchServer();
                return { url: chromiumServer.wsEndpoint() };
            },
            type: 'websocket',
            release: async () => {}
        },
        connectOptions: {
            wsEndpoint: firefoxServer.wsEndpoint()
        }
    },
    requestHandler: async ({ page, log }) => {
        log.info(await page.evaluate(() => navigator.userAgent));
    },
});

await crawler.run(['https://crawlee.dev/js']);

This will print Firefox's user agent, i.e., remoteBrowser is dropped, connectOptions prevails

One more reason to enforce the mutual exclusivity on the type level 😄

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let's also add a docs article about how to set up the remote browser-enabled crawler (short page with examples, main limitations, etc. would imo suffice for now).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, also temp-examples dropped

l2ysho added 15 commits May 27, 2026 13:11
Drop the _maybeWrapWithSharedContext workaround that emulated
launchPersistentContext semantics over connectOverCDP. Remote
connections now always run with useIncognitoPages: true; explicit
false is overridden with a warning pointing users at SessionPool
for cross-request state sharing.

Also remove the now-unused parentBrowser plumbing from
PlaywrightBrowser, which only existed to keep the underlying
CDP Browser alive while the wrapper was active.
Adds a vitest-based integration suite under test/integration that
exercises Crawlee end-to-end against a real Browserless instance.
The first test verifies the force-incognito behavior for remote
Playwright CDP connections: two requests landing on the same browser
do not share cookies even when retireBrowserAfterPageCount is high
and saveResponseCookies is disabled.

Gated on CRAWLEE_DIFFICULT_TESTS so `pnpm test` skips the suite by
default — `pnpm test:integration` and `pnpm test:full` set the flag.
The suite expects Browserless and httpbin running on a shared Docker
network; `pnpm test:integration:services:up` spins them up locally,
and a new GitHub Actions workflow provides them as service containers.

Also sets core-js-pure: false in pnpm-workspace.yaml allowBuilds to
match prior skip-by-default behavior under pnpm 11.
Replace the warn-and-silently-drop path with a constructor throw in both
PlaywrightPlugin and PuppeteerPlugin when more than one of remoteBrowser,
connectOptions, or connectOverCDPOptions is set. Fixes the doc/impl
mismatch where JSDoc claimed remoteBrowser "Takes precedence" but the
implementation actually dropped it.
PlaywrightLauncher and PuppeteerLauncher now throw if launchOptions is
set alongside connectOptions, connectOverCDPOptions, or remoteBrowser.
The launcher is the right layer for this check — at the plugin level the
launcher always injects defaults (executablePath) into launchOptions, so
the plugin cannot distinguish user-set from framework-default. Removes
the now-unreachable executablePath warning and consolidates the useChrome
warning behind the unified hasRemote flag.
CDP is also a WebSocket protocol, so 'websocket' was a misleading label.
Rename to 'playwright', which names the actual transport (Playwright's
client-server protocol exposed via browserType.connect()). Updated:
RemoteBrowserConfig.type, PlaywrightRemoteBrowserConfig.type,
RemoteBrowserProvider.type, the playwright-plugin branch, the puppeteer
"not supported" error message, the connect log line, and all tests.
The RemoteBrowserConfig / RemoteBrowserProvider abstraction is built for
remote browser services (Browserless, Browserbase, Steel), which all
speak CDP. The 'websocket'/'playwright' branch (browserType.connect())
had no real provider behind it, and naming it 'websocket' was misleading
(CDP also rides WebSocket). Rather than commit to a name that BiDi will
make obsolete anyway, drop the field entirely. Callers who genuinely
need Playwright's connect() can still use connectOptions directly.

Removes:
- RemoteBrowserConfig.type and RemoteBrowserProvider.type
- PlaywrightRemoteBrowserConfig and PuppeteerRemoteBrowserConfig
  (now-empty interface extensions)
- The 'playwright' branch in PlaywrightPlugin._launch
- The "Puppeteer does not support 'playwright'" throw + tests
- 5 type-related test cases
The crawler-level 'headless' shortcut synthesized a launchContext.launchOptions
object, which then tripped the launcher's mutual-exclusion check against
remoteBrowser. Warn and skip the mutation instead — remote services control
headless mode anyway. Mirrors the existing useChrome warning in the launcher.
Replace apify/workflows/pnpm-install@main with a direct pnpm install
call without --loglevel error and without the pnpm store cache, to
surface the actual error behind the 8-min silent hang on Node 24.

Revert once root cause is identified.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants