-
Notifications
You must be signed in to change notification settings - Fork 348
Feature/add metadata output #150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PR Summary: Metadata Output FeatureWhat We Fixed1. Metadata Output FeatureAdded 2. ZMQ Linking Build System FixFixed editable install ( Why this was needed:
Root cause:
Fixes applied:
Known Limitations1. AST Chunking Metadata Loss
2. Test Failure
Test Results11/12 tests pass. The one failure is because of that limitation I mentioned. If you could merge the changes, I can continue to fix the ast chunking part in this route. |
Merge Conflict ResolutionThis PR was rebased on the latest Conflict Resolution DecisionBoth PRs (#149 and this PR #150) solve the same problem: displaying chunk source information. However, they use different approaches: PR #149 approach (upstream): text_with_source = "Chunk source:" + source_path + "\n" + node.get_content().replace("\n", " ")
builder.add_text(chunk_text, {"source": chunk_source})
This PR's approach: chunk_metadata = {
"file_path": file_path or source_path,
"file_name": doc.metadata.get("file_name", ""),
"creation_date": ...,
"last_modified_date": ...
}
builder.add_text(chunk["text"], metadata=chunk["metadata"])
Why This Approach is Superior
The conflict was resolved by keeping the structured approach while maintaining compatibility with the existing API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ww2283 Great work on the metadata feature! The structured approach is much cleaner than string embedding.
Before we can fully approve I think a few changes should be made:
- Please resolve the merge conflicts with main (PR #149 was merged). Not too sure if I am missing anything here.
- Consider fixing the AST chunking metadata loss issue
- The ZMQ fixes look solid - good catch on the linking issues
Integration Note: I made a PR for the token limit fixes for #153 that will complement both this PR and #152. If you can see the similarites between the two and see if we need to make any resolves, i think we can move forward. Once these are resolved, we can create a comprehensive solution for Ollama embeddings.
|
Hi @ASuresh0524, thanks for the review! Addressing your feedback: 1. Merge Conflicts (PR #149)✅ Resolved. I merged upstream/main into both PR #150 and #152. The cli.py conflicts were resolved by keeping the structured metadata approach. The remaining conflict is 2.
|
|
Step 3: Fix AST metadata loss (new PR, which I will do it after 152 merging) Test with Ollama batch processing Review overlap in embedding_compute.py Yeah, then we can start working on @ww2283 , let me know whats your idea here, do you think #154 solve all of these? |
45b87ce to
a85d0ad
Compare
|
@yichuan-w not really, I will help to resolve the conflicts from current 154. I'm working on a pr which should land soon (today or tomorrow) that integrates current 154 changes and resolve conflicts while get the ast chunking to work with metadata. |
|
Yeah, great, thanks for your contribution! Let me know when I need to review, and you can join our Slack, we can actively discuss there!! |
|
just to clarify @ww2283 will you be fixing the error on 154 or do you want me to help on that end? but thanks for the help @yichuan-w! |
I have commented in 154 thread. Sorry for the slow reply. |
What does this PR do?
Add
--show-metadataflag to display file metadata in search results and fix ZMQlinking issues in Python extension build system.
Related Issues
Fixes #144
Checklist
uv run pytest)ruff formatandruff check)pre-commit run --all-files)