Skip to content

Conversation

@tishun
Copy link

@tishun tishun commented Oct 20, 2025

First off - massive apologies to @Dag7 - we seem to both started to work on generally the same issue and we only realised that when we had to merge his changes in #8963.

Goal

The purpose of this change is to align the implementation of langchainjs with that of langchain, specifically in regards to using Redis as a vector store. There are some major drifts in both solutions, some of which make them incompatible with each other (one app using langchainjs and one langchain might have trouble saving to the same Redis store).

Solution

Some of these were addressed by @Dag7 already, but we wanted to provide a more complete implementation, including:

  • API abstractions that make it easier to build up custom queries, possibly addressing problems such as How to use RedisVectorStoreFilterType? #5010
  • do not expose driver specific (node-redis) models as part of the API of langchainjs (potentially allowing for driver change of required)
  • using custom filters does not rely on providing a custom schema, instead the langchainjs driver would infer the schema from the provided metadata
  • attempt to provide backwards compatibility with the old implementation, instead of providing a new one
  • extend the integration tests with a lot of scenarios that were missing
  • use UUIDs for generating keys, similar to langchain
  • etc.

Please let us know if there is something we can do to improve this solution.

IMHO it would be very good to have it in the 1.0 release, otherwise we would have to change the contract again after releasing the solution provided by @Dag7

@changeset-bot
Copy link

changeset-bot bot commented Oct 20, 2025

⚠️ No Changeset found

Latest commit: 121d7a0

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@tishun tishun mentioned this pull request Oct 20, 2025
@Dag7
Copy link
Contributor

Dag7 commented Oct 20, 2025

Hey @tishun 👋
No worries at all 😅

My main goal with this change was to keep it fully backward compatible — for example, I intentionally avoided forcing UUIDs, since many existing users already rely on predictable or custom keys. Forcing UUIDs could break compatibility or make adoption slower for those users.

I really like your approach to making the implementation driver-agnostic — that direction makes a lot of sense. We’ll just need to make sure we add some form of adapter layer since different clients (like ioredis vs node-redis) behave differently in a few areas (pipelines, return types, etc.).

Regarding the schema, I decided to have it explicitly defined at index creation time. That way, the index and the schema are always aligned, and we can validate metadata on insert. If we infer the schema automatically from the first batch of documents, we might not capture all possible fields (since not all are required), which leads to a tricky situation:

  • The index would be created based on an incomplete schema
  • Later inserts could include new fields or types we will have to decide whether to re-index, ignore, or error out — all of which have trade-offs.

So defining the schema first felt like the safest and most predictable approach.

let me know what you think

@tishun
Copy link
Author

tishun commented Oct 22, 2025

My main goal with this change was to keep it fully backward compatible — for example, I intentionally avoided forcing UUIDs, since many existing users already rely on predictable or custom keys. Forcing UUIDs could break compatibility or make adoption slower for those users.

Agreed, backwards compatibility (and predictability) will be sacrificed if we choose to use UUIDs. We would gain - however - compatibility with the python implementation. To be fair they are not really incompatible and collisions are highly unlikely even now, but strictly speaking they will generate keys in two different ways. One could argue that this is also a benefit, allowing us to identify which driver added which vectors, so I am not at all adamant on this change. If you recommend that we revert to the old implementation I see no problem for us to do that (specifically in that regard).

I really like your approach to making the implementation driver-agnostic — that direction makes a lot of sense. We’ll just need to make sure we add some form of adapter layer since different clients (like ioredis vs node-redis) behave differently in a few areas (pipelines, return types, etc.).

True, and this is also why I stopped short at a complete solution. The change becomes quite large; and the benefit right now is not entirely visible (Redis intends to support node-redis primarily and in terms of functionality, quality and performance it is also better). This part of my change was more "good-practise" and would allow (should this becomes necessary) to have a smaller impact if we migrate from one driver to another.

So defining the schema first felt like the safest and most predictable approach.

Completely reasonable. Using a custom schema is definitely the more stable solution; and I assume most users would do that. Inferring the schema would be a generally lazier (but still valid - if used correctly) approach. I think they both serve different use cases:

  • a simple approach where metadata is always the same for all documents and it could (safely) be inferred by the driver; and thus a much simpler usage is required
  • an advanced use-case - perhaps the one most user would choose - where the metadata schema is defined by the user and the driver follows these definitions strictly

BTW I am not sure it was apparent from my description, but currently both modes are available:

  • legacy filter or legacy metadata field (in the existing vector storage) results in legacy metadata handling
  • missing custom schema results in a custom metadata schema being inferred from the first batch of documents
  • existing custom metadata schema results in it being applied with priority

This is also - mostly - how the Python implementation of langchain works.

Does that make sense?

"format:check": "prettier --config .prettierrc --check \"src\""
},
"dependencies": {
"uuid": "^10.0.0",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you run pnpm install after this change? It would probably update the pnpm-lock.yaml that should also be added to the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants