Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 56 additions & 2 deletions apps/site/docs/en/api.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,58 @@ In Playwright and Puppeteer, there are some common parameters:
- `forceSameTabNavigation: boolean`: If true, page navigation is restricted to the current tab. (Default: true)
- `waitForNavigationTimeout: number`: The timeout for waiting for navigation finished. (Default: 5000ms, set to 0 means not waiting for navigation finished)

These Agents also support the following advanced configuration parameters:

- `modelConfig: () => IModelConfig`: Optional. Custom model configuration function. Allows you to dynamically configure different models through code instead of environment variables. This is particularly useful when you need to use different models for different AI tasks (such as VQA, planning, grounding, etc.).

**Example:**
```typescript
const agent = new PuppeteerAgent(page, {
modelConfig: () => ({
MIDSCENE_MODEL_NAME: 'qwen3-vl-plus',
MIDSCENE_MODEL_BASE_URL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
MIDSCENE_MODEL_API_KEY: 'sk-...',
MIDSCENE_LOCATOR_MODE: 'qwen3-vl'
})
});
```

- `createOpenAIClient: (config) => OpenAI`: Optional. Custom OpenAI client factory function. Allows you to create custom OpenAI client instances for integrating observability tools (such as LangSmith, LangFuse) or using custom OpenAI-compatible clients.

**Parameter Description:**
- `config.modelName: string` - Model name
- `config.openaiApiKey?: string` - API key
- `config.openaiBaseURL?: string` - API endpoint URL
- `config.intent: string` - AI task type ('VQA' | 'planning' | 'grounding' | 'default')
- `config.vlMode?: string` - Visual language model mode
- Other configuration parameters...

**Example (LangSmith Integration):**
```typescript
import OpenAI from 'openai';
import { wrapOpenAI } from 'langsmith/wrappers';

const agent = new PuppeteerAgent(page, {
createOpenAIClient: (config) => {
const openai = new OpenAI({
apiKey: config.openaiApiKey,
baseURL: config.openaiBaseURL,
});

// Wrap with LangSmith for planning tasks
if (config.intent === 'planning') {
return wrapOpenAI(openai, {
metadata: { task: 'planning' }
});
}

return openai;
}
});
```

**Note:** `createOpenAIClient` overrides the behavior of the `MIDSCENE_LANGSMITH_DEBUG` environment variable. If you provide a custom client factory function, you need to handle the integration of LangSmith or other observability tools yourself.

In Puppeteer, there is also a parameter:

- `waitForNetworkIdleTimeout: number`: The timeout for waiting for network idle between each action. (Default: 2000ms, set to 0 means not waiting for network idle)
Expand Down Expand Up @@ -854,9 +906,11 @@ You can override environment variables at runtime by calling the `overrideAIConf
import { overrideAIConfig } from '@midscene/web/puppeteer'; // or another Agent

overrideAIConfig({
OPENAI_BASE_URL: '...',
OPENAI_API_KEY: '...',
MIDSCENE_MODEL_NAME: '...',
MODEL_BASE_URL: '...', // recommended, use new variable name
MODEL_API_KEY: '...', // recommended, use new variable name
// OPENAI_BASE_URL: '...', // deprecated but still compatible
// OPENAI_API_KEY: '...', // deprecated but still compatible
});
```

Expand Down
46 changes: 31 additions & 15 deletions apps/site/docs/en/choose-a-model.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,22 @@ import TroubleshootingLLMConnectivity from './common/troubleshooting-llm-connect

Choose one of the following models, obtain the API key, complete the configuration, and you are ready to go. Choose the model that is easiest to obtain if you are a beginner.

## Environment Variable Configuration

Starting from version 1.0, Midscene.js recommends using the following new environment variable names:

- `MODEL_API_KEY` - API key (recommended)
- `MODEL_BASE_URL` - API endpoint URL (recommended)

For backward compatibility, the following legacy variable names are still supported:

- `OPENAI_API_KEY` - API key (deprecated but still compatible)
- `OPENAI_BASE_URL` - API endpoint URL (deprecated but still compatible)

When both new and old variables are set, the new variables (`MODEL_*`) will take precedence.

In the configuration examples throughout this document, we will use the new variable names. If you are currently using the old variable names, there's no need to change them immediately - they will continue to work.

## Adapted models for using Midscene.js

Midscene.js supports two types of models, visual-language models and LLM models.
Expand Down Expand Up @@ -46,8 +62,8 @@ We recommend the Qwen3-VL series, which clearly outperforms Qwen2.5-VL. Qwen3-VL
Using the Alibaba Cloud `qwen3-vl-plus` model as an example:

```bash
OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
OPENAI_API_KEY="......"
MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
MODEL_API_KEY="......"
MIDSCENE_MODEL_NAME="qwen3-vl-plus"
MIDSCENE_USE_QWEN3_VL=1 # Note: cannot be set together with MIDSCENE_USE_QWEN_VL
```
Expand All @@ -57,8 +73,8 @@ MIDSCENE_USE_QWEN3_VL=1 # Note: cannot be set together with MIDSCENE_USE_QWEN_VL
Using the Alibaba Cloud `qwen-vl-max-latest` model as an example:

```bash
OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
OPENAI_API_KEY="......"
MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
MODEL_API_KEY="......"
MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
MIDSCENE_USE_QWEN_VL=1 # Note: cannot be set together with MIDSCENE_USE_QWEN3_VL
```
Expand All @@ -85,8 +101,8 @@ They perform strongly for visual grounding and assertion in complex scenarios. W
After obtaining an API key from [Volcano Engine](https://volcengine.com), you can use the following configuration:

```bash
OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
OPENAI_API_KEY="...."
MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MODEL_API_KEY="...."
MIDSCENE_MODEL_NAME="ep-..." # Inference endpoint ID or model name from Volcano Engine
MIDSCENE_USE_DOUBAO_VISION=1
```
Expand All @@ -108,8 +124,8 @@ When using Gemini-2.5-Pro, set `MIDSCENE_USE_GEMINI=1` to enable Gemini-specific
After applying for the API key on [Google Gemini](https://gemini.google.com/), you can use the following config:

```bash
OPENAI_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
OPENAI_API_KEY="......"
MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MODEL_API_KEY="......"
MIDSCENE_MODEL_NAME="gemini-2.5-pro-preview-05-06"
MIDSCENE_USE_GEMINI=1
```
Expand All @@ -130,8 +146,8 @@ With UI-TARS you can use goal-driven prompts, such as "Log in with username foo
You can use the deployed `doubao-1.5-ui-tars` on [Volcano Engine](https://volcengine.com).

```bash
OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
OPENAI_API_KEY="...."
MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MODEL_API_KEY="...."
MIDSCENE_MODEL_NAME="ep-2025..." # Inference endpoint ID or model name from Volcano Engine
MIDSCENE_USE_VLM_UI_TARS=DOUBAO
```
Expand Down Expand Up @@ -164,8 +180,8 @@ The token cost of GPT-4o is relatively high because Midscene sends DOM informati
**Config**

```bash
OPENAI_API_KEY="......"
OPENAI_BASE_URL="https://custom-endpoint.com/compatible-mode/v1" # Optional, if you want an endpoint other than the default OpenAI one.
MODEL_API_KEY="......"
MODEL_BASE_URL="https://custom-endpoint.com/compatible-mode/v1" # Optional, if you want an endpoint other than the default OpenAI one.
MIDSCENE_MODEL_NAME="gpt-4o-2024-11-20" # Optional. The default is "gpt-4o".
```

Expand All @@ -176,7 +192,7 @@ Other models are also supported by Midscene.js. Midscene will use the same promp

1. A multimodal model is required, which means it must support image input.
1. The larger the model, the better it works. However, it needs more GPU or money.
1. Find out how to to call it with an OpenAI SDK compatible endpoint. Usually you should set the `OPENAI_BASE_URL`, `OPENAI_API_KEY` and `MIDSCENE_MODEL_NAME`. Config are described in [Config Model and Provider](./model-provider).
1. Find out how to to call it with an OpenAI SDK compatible endpoint. Usually you should set the `MODEL_BASE_URL`, `MODEL_API_KEY` and `MIDSCENE_MODEL_NAME`. Config are described in [Config Model and Provider](./model-provider).
1. If you find it not working well after changing the model, you can try using some short and clear prompt, or roll back to the previous model. See more details in [Prompting Tips](./prompting-tips).
1. Remember to follow the terms of use of each model and provider.
1. Don't include the `MIDSCENE_USE_VLM_UI_TARS` and `MIDSCENE_USE_QWEN_VL` config unless you know what you are doing.
Expand All @@ -185,8 +201,8 @@ Other models are also supported by Midscene.js. Midscene will use the same promp

```bash
MIDSCENE_MODEL_NAME="....."
OPENAI_BASE_URL="......"
OPENAI_API_KEY="......"
MODEL_BASE_URL="......"
MODEL_API_KEY="......"
```

For more details and sample config, see [Config Model and Provider](./model-provider).
Expand Down
34 changes: 19 additions & 15 deletions apps/site/docs/en/model-provider.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,14 @@ In this article, we will show you how to config AI service provider and how to c
## Configs

### Common configs
These are the most common configs, in which `OPENAI_API_KEY` is required.
These are the most common configs, in which `MODEL_API_KEY` or `OPENAI_API_KEY` is required.

| Name | Description |
|------|-------------|
| `OPENAI_API_KEY` | Required. Your OpenAI API key (e.g. "sk-abcdefghijklmnopqrstuvwxyz") |
| `OPENAI_BASE_URL` | Optional. Custom endpoint URL for API endpoint. Use it to switch to a provider other than OpenAI (e.g. "https://some_service_name.com/v1") |
| `MODEL_API_KEY` | Required (recommended). Your API key (e.g. "sk-abcdefghijklmnopqrstuvwxyz") |
| `MODEL_BASE_URL` | Optional (recommended). Custom endpoint URL for API endpoint. Use it to switch to a provider other than OpenAI (e.g. "https://some_service_name.com/v1") |
| `OPENAI_API_KEY` | Deprecated but still compatible. Recommended to use `MODEL_API_KEY` |
| `OPENAI_BASE_URL` | Deprecated but still compatible. Recommended to use `MODEL_BASE_URL` |
| `MIDSCENE_MODEL_NAME` | Optional. Specify a different model name other than `gpt-4o` |

Extra configs to use `Qwen 2.5 VL` model:
Expand Down Expand Up @@ -69,7 +71,7 @@ Pick one of the following ways to config environment variables.

```bash
# replace by your own
export OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz"
export MODEL_API_KEY="sk-abcdefghijklmnopqrstuvwxyz"

# if you are not using the default OpenAI model, you need to config more params
# export MIDSCENE_MODEL_NAME="..."
Expand All @@ -89,7 +91,7 @@ npm install dotenv --save
Create a `.env` file in your project root directory, and add the following content. There is no need to add `export` before each line.

```
OPENAI_API_KEY=sk-abcdefghijklmnopqrstuvwxyz
MODEL_API_KEY=sk-abcdefghijklmnopqrstuvwxyz
```

Import the dotenv module in your script. It will automatically read the environment variables from the `.env` file.
Expand All @@ -110,6 +112,8 @@ import { overrideAIConfig } from "@midscene/web/puppeteer";

overrideAIConfig({
MIDSCENE_MODEL_NAME: "...",
MODEL_BASE_URL: "...", // recommended, use new variable name
MODEL_API_KEY: "...", // recommended, use new variable name
// ...
});
```
Expand All @@ -119,8 +123,8 @@ overrideAIConfig({
Configure the environment variables:

```bash
export OPENAI_API_KEY="sk-..."
export OPENAI_BASE_URL="https://endpoint.some_other_provider.com/v1" # config this if you want to use a different endpoint
export MODEL_API_KEY="sk-..."
export MODEL_BASE_URL="https://endpoint.some_other_provider.com/v1" # config this if you want to use a different endpoint
export MIDSCENE_MODEL_NAME="gpt-4o-2024-11-20" # optional, the default is "gpt-4o"
```

Expand All @@ -129,8 +133,8 @@ export MIDSCENE_MODEL_NAME="gpt-4o-2024-11-20" # optional, the default is "gpt-4
Configure the environment variables:

```bash
export OPENAI_API_KEY="sk-..."
export OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
export MODEL_API_KEY="sk-..."
export MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
export MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
export MIDSCENE_USE_QWEN_VL=1
```
Expand All @@ -142,8 +146,8 @@ Configure the environment variables:


```bash
export OPENAI_BASE_URL="https://ark-cn-beijing.bytedance.net/api/v3"
export OPENAI_API_KEY="..."
export MODEL_BASE_URL="https://ark-cn-beijing.bytedance.net/api/v3"
export MODEL_API_KEY="..."
export MIDSCENE_MODEL_NAME='ep-...'
export MIDSCENE_USE_DOUBAO_VISION=1
```
Expand All @@ -153,17 +157,17 @@ export MIDSCENE_USE_DOUBAO_VISION=1
Configure the environment variables:

```bash
export OPENAI_API_KEY="sk-..."
export OPENAI_BASE_URL="http://localhost:1234/v1"
export MODEL_API_KEY="sk-..."
export MODEL_BASE_URL="http://localhost:1234/v1"
export MIDSCENE_MODEL_NAME="ui-tars-72b-sft"
export MIDSCENE_USE_VLM_UI_TARS=1
```

## Example: config request headers (like for openrouter)

```bash
export OPENAI_BASE_URL="https://openrouter.ai/api/v1"
export OPENAI_API_KEY="..."
export MODEL_BASE_URL="https://openrouter.ai/api/v1"
export MODEL_API_KEY="..."
export MIDSCENE_MODEL_NAME="..."
export MIDSCENE_OPENAI_INIT_CONFIG_JSON='{"defaultHeaders":{"HTTP-Referer":"...","X-Title":"..."}}'
```
Expand Down
60 changes: 58 additions & 2 deletions apps/site/docs/zh/api.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,58 @@ Midscene 中每个 Agent 都有自己的构造函数。
- `forceSameTabNavigation: boolean`: 如果为 true,则限制页面在当前 tab 打开。默认值为 true。
- `waitForNavigationTimeout: number`: 在页面跳转后等待页面加载完成的超时时间,默认值为 5000ms,设置为 0 则不做等待。

这些 Agent 还支持以下高级配置参数:

- `modelConfig: () => IModelConfig`: 可选。自定义模型配置函数。允许你通过代码动态配置不同的模型,而不是通过环境变量。这在需要为不同的 AI 任务(如 VQA、规划、定位等)使用不同模型时特别有用。

**示例:**
```typescript
const agent = new PuppeteerAgent(page, {
modelConfig: () => ({
MIDSCENE_MODEL_NAME: 'qwen3-vl-plus',
MIDSCENE_MODEL_BASE_URL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
MIDSCENE_MODEL_API_KEY: 'sk-...',
MIDSCENE_LOCATOR_MODE: 'qwen3-vl'
})
});
```

- `createOpenAIClient: (config) => OpenAI`: 可选。自定义 OpenAI 客户端工厂函数。允许你创建自定义的 OpenAI 客户端实例,用于集成可观测性工具(如 LangSmith、LangFuse)或使用自定义的 OpenAI 兼容客户端。

**参数说明:**
- `config.modelName: string` - 模型名称
- `config.openaiApiKey?: string` - API 密钥
- `config.openaiBaseURL?: string` - API 接入地址
- `config.intent: string` - AI 任务类型('VQA' | 'planning' | 'grounding' | 'default')
- `config.vlMode?: string` - 视觉语言模型模式
- 其他配置参数...

**示例(集成 LangSmith):**
```typescript
import OpenAI from 'openai';
import { wrapOpenAI } from 'langsmith/wrappers';

const agent = new PuppeteerAgent(page, {
createOpenAIClient: (config) => {
const openai = new OpenAI({
apiKey: config.openaiApiKey,
baseURL: config.openaiBaseURL,
});

// 为规划任务包装 LangSmith
if (config.intent === 'planning') {
return wrapOpenAI(openai, {
metadata: { task: 'planning' }
});
}

return openai;
}
});
```

**注意:** `createOpenAIClient` 会覆盖 `MIDSCENE_LANGSMITH_DEBUG` 环境变量的行为。如果你提供了自定义的客户端工厂函数,需要自行处理 LangSmith 或其他可观测性工具的集成。

在 Puppeteer 中,还有以下参数:

- `waitForNetworkIdleTimeout: number`: 在执行每个操作后等待网络空闲的超时时间,默认值为 2000ms,设置为 0 则不做等待。
Expand Down Expand Up @@ -863,9 +915,13 @@ console.log(logContent);
import { overrideAIConfig } from '@midscene/web/puppeteer'; // 或其他的 Agent

overrideAIConfig({
OPENAI_BASE_URL: '...',
OPENAI_API_KEY: '...',
MODEL_BASE_URL: '...', // 推荐使用新的变量名
MODEL_API_KEY: '...', // 推荐使用新的变量名
MIDSCENE_MODEL_NAME: '...',

// 旧的变量名仍然兼容:
// OPENAI_BASE_URL: '...',
// OPENAI_API_KEY: '...',
});
```

Expand Down
Loading