Extract links #655

seanmcguire12 · 2025-04-09T22:42:48Z

why

we want to be able to extract links from a webpage

usage

in your zod schema, add a filed with the the type z.string().url()

example:

const extraction = await stagehand.page.extract({
  instruction: "extract the link to the 'contact us' page",
  schema: z.object({
    link: z.string().url(),
  }),
});

how it works

if a user has defined a schema where a field type is z.string().url(), inside the extract handler, that field type is converted into z.number(), and the path to the converted field is stored
after this type conversion, we make the inference call to extract. the LLM will return with IDs (of type number) in the converted field
these IDs are used as keys to find the corresponding URL in the idToUrl mapping
then, we iteratively update replace the IDs with the actual URLs, converting the types back to z.string().url() as we go

what changed

added logic in extractHandler to recurse through the user defined schema, and transform any fields of type z.string().url() to type z.number() while recording their path within the schema
added logic to inject the URLs into the extraction result while restoring the original types

test plan

added evals

…xt-nodes

changeset-bot · 2025-04-09T22:42:51Z

🦋 Changeset detected

Latest commit: a6a1184

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@browserbasehq/stagehand	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

greptile-apps

PR Summary

This PR enhances the extraction functionality by integrating link extraction support and updating the metadata and accessibility tree processing.

Updated /lib/inference.ts to add a new 'url_field' in the metadata schema and return it as 'urlField'.
Modified /lib/prompt.ts to adjust extraction prompts for DOM-based link extraction and instruct returning ID-only link outputs.
Enhanced /lib/a11y/utils.ts to remove redundant StaticText and extract URLs from node properties.
Introduced a recursive ID-to-URL replacement in /lib/handlers/extractHandler.ts.
Extended type definitions in /types/context.ts to support URL mapping.

_{5 file(s) reviewed, no comment(s)}
_{Edit PR Review Bot Settings | Greptile}

# Conflicts: # lib/a11y/utils.ts

greptile-apps

PR Summary

(updates since last review)

Added two new evaluation tasks to test URL extraction functionality and implemented the core URL extraction mechanism in the extract handler.

Added extract_single_link.ts and extract_jfk_links.ts evaluation tasks that test extracting URLs using z.string().url() schema type
Implemented schema transformation in extractHandler.ts that converts z.string().url() fields to z.number() during processing
Added URL injection mechanism that replaces numeric IDs with actual URLs in the extraction results
Modified evals.config.json to include the new evaluation tasks under the 'extract' category
Note: URL extraction is implemented in domExtract method but not in textExtract method which is used by default

Greptile AI

_{6 file(s) reviewed, 2 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

evals/tasks/extract_single_link.ts

greptile-apps · 2025-04-14T21:27:18Z

lib/handlers/extractHandler.ts

+        const mappedUrl = idToUrlMapping[String(fieldValue)];
+        record[key] = mappedUrl ?? ``;
+      }


style: If mappedUrl is undefined, this sets the field to an empty string. Consider returning null or undefined instead, or adding a warning log when a URL mapping is missing.

darasus · 2025-05-01T13:24:51Z

@seanmcguire12 which models have been used to test this? I've tried o3-mini and gemini-flash but no luck

seanmcguire12 added 11 commits April 8, 2025 14:51

remove redundant static text children

5ddb2e8

prettier

d31329e

changeset

346acd8

add link mapping to TreeResult

716f6a4

changeset

16d6d25

Merge remote-tracking branch 'origin/main' into collapse-duplicate-te…

4f54003

…xt-nodes

Merge branch 'collapse-duplicate-text-nodes' into add-link-mapping

2e1a8cc

get url field from metadata inference call

0866877

map id->url

e33f23c

update prompt

4be76c3

useTextExtract should be default false

fc82c01

changeset

81a5233

greptile-apps bot reviewed Apr 9, 2025

View reviewed changes

seanmcguire12 added 3 commits April 9, 2025 15:48

Merge branch 'add-link-mapping' into extract-links

a321a9d

Merge remote-tracking branch 'origin/main' into extract-links

908a8bc

# Conflicts: # lib/a11y/utils.ts

better naming

8c99828

kamath mentioned this pull request Apr 11, 2025

Cannot extract link targets (anymore) #651

Closed

seanmcguire12 marked this pull request as draft April 14, 2025 18:04

seanmcguire12 added 7 commits April 14, 2025 12:13

Merge remote-tracking branch 'origin/main' into extract-links

3f57082

rm changeset

d794ad3

schema patch approach

49b992f

rm urlField from metadata schema

e511502

empty string if no URL found for ID

42ba19d

add eval

87c7ddb

add another eval

a6a1184

seanmcguire12 marked this pull request as ready for review April 14, 2025 21:26

greptile-apps bot reviewed Apr 14, 2025

View reviewed changes

seanmcguire12 added the extract These changes pertain to the extract function label Apr 14, 2025

seanmcguire12 requested a review from kamath April 16, 2025 18:58

seanmcguire12 requested review from miguelg719 and sameelarif April 16, 2025 18:58

miguelg719 approved these changes Apr 16, 2025

View reviewed changes

seanmcguire12 merged commit 8814af9 into main Apr 16, 2025
26 checks passed

github-actions bot mentioned this pull request Apr 16, 2025

Version Packages #665

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract links #655

Extract links #655

Uh oh!

seanmcguire12 commented Apr 9, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Apr 9, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

greptile-apps bot Apr 14, 2025

Uh oh!

Uh oh!

darasus commented May 1, 2025

Uh oh!

Uh oh!

Extract links #655

Extract links #655

Uh oh!

Conversation

seanmcguire12 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

usage

example:

how it works

what changed

test plan

Uh oh!

changeset-bot bot commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

Uh oh!

Uh oh!

greptile-apps bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

darasus commented May 1, 2025

Uh oh!

Uh oh!

seanmcguire12 commented Apr 9, 2025 •

edited

Loading

changeset-bot bot commented Apr 9, 2025 •

edited

Loading