Skip to content

Extract links #655

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Apr 16, 2025
Merged

Extract links #655

merged 22 commits into from
Apr 16, 2025

Conversation

seanmcguire12
Copy link
Member

@seanmcguire12 seanmcguire12 commented Apr 9, 2025

why

  • we want to be able to extract links from a webpage

usage

  • in your zod schema, add a filed with the the type z.string().url()

example:

const extraction = await stagehand.page.extract({
  instruction: "extract the link to the 'contact us' page",
  schema: z.object({
    link: z.string().url(),
  }),
});

how it works

  • if a user has defined a schema where a field type is z.string().url(), inside the extract handler, that field type is converted into z.number(), and the path to the converted field is stored
  • after this type conversion, we make the inference call to extract. the LLM will return with IDs (of type number) in the converted field
  • these IDs are used as keys to find the corresponding URL in the idToUrl mapping
  • then, we iteratively update replace the IDs with the actual URLs, converting the types back to z.string().url() as we go

what changed

  • added logic in extractHandler to recurse through the user defined schema, and transform any fields of type z.string().url() to type z.number() while recording their path within the schema
  • added logic to inject the URLs into the extraction result while restoring the original types

test plan

  • added evals

Copy link

changeset-bot bot commented Apr 9, 2025

🦋 Changeset detected

Latest commit: a6a1184

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@browserbasehq/stagehand Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR enhances the extraction functionality by integrating link extraction support and updating the metadata and accessibility tree processing.

  • Updated /lib/inference.ts to add a new 'url_field' in the metadata schema and return it as 'urlField'.
  • Modified /lib/prompt.ts to adjust extraction prompts for DOM-based link extraction and instruct returning ID-only link outputs.
  • Enhanced /lib/a11y/utils.ts to remove redundant StaticText and extract URLs from node properties.
  • Introduced a recursive ID-to-URL replacement in /lib/handlers/extractHandler.ts.
  • Extended type definitions in /types/context.ts to support URL mapping.

5 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile

@seanmcguire12 seanmcguire12 marked this pull request as draft April 14, 2025 18:04
@seanmcguire12 seanmcguire12 marked this pull request as ready for review April 14, 2025 21:26
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

(updates since last review)

Added two new evaluation tasks to test URL extraction functionality and implemented the core URL extraction mechanism in the extract handler.

  • Added extract_single_link.ts and extract_jfk_links.ts evaluation tasks that test extracting URLs using z.string().url() schema type
  • Implemented schema transformation in extractHandler.ts that converts z.string().url() fields to z.number() during processing
  • Added URL injection mechanism that replaces numeric IDs with actual URLs in the extraction results
  • Modified evals.config.json to include the new evaluation tasks under the 'extract' category
  • Note: URL extraction is implemented in domExtract method but not in textExtract method which is used by default

Greptile AI

6 file(s) reviewed, 2 comment(s)
Edit PR Review Bot Settings | Greptile

Comment on lines +889 to +891
const mappedUrl = idToUrlMapping[String(fieldValue)];
record[key] = mappedUrl ?? ``;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: If mappedUrl is undefined, this sets the field to an empty string. Consider returning null or undefined instead, or adding a warning log when a URL mapping is missing.

@seanmcguire12 seanmcguire12 added the extract These changes pertain to the extract function label Apr 14, 2025
@seanmcguire12 seanmcguire12 requested a review from kamath April 16, 2025 18:58
@seanmcguire12 seanmcguire12 merged commit 8814af9 into main Apr 16, 2025
26 checks passed
@github-actions github-actions bot mentioned this pull request Apr 16, 2025
@darasus
Copy link

darasus commented May 1, 2025

@seanmcguire12 which models have been used to test this? I've tried o3-mini and gemini-flash but no luck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extract These changes pertain to the extract function
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants