-
Notifications
You must be signed in to change notification settings - Fork 4
Adding range-pruning-ratio #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds range pruning functionality to the inverted index by introducing a configurable range_pruning_ratio parameter. The feature allows selective inclusion of posting list entries based on their scores to optimize index size and performance.
- Adds
range_pruning_ratiofield toIndexBuilderandCiffToBmpstructs - Implements score-based pruning logic that filters posting list entries below a calculated threshold
- Updates the CIFF to BMP conversion pipeline to support the new pruning parameter
Reviewed Changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/index/inverted_index.rs | Adds range_pruning_ratio field to IndexBuilder and implements pruning logic in posting list processing |
| src/ciff/mod.rs | Adds range_pruning_ratio parameter to CiffToBmp struct and passes it through the conversion pipeline |
Comments suppressed due to low confidence (1)
| compress_range: bool, | ||
| range_pruning_ratio: f32, | ||
| } | ||
|
|
Copilot
AI
Jul 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The range_pruning_ratio field lacks initialization in the struct. This could cause issues when creating new instances without explicitly setting this field.
| impl Default for CiffToBmp { | |
| fn default() -> Self { | |
| Self { | |
| input: None, | |
| output: None, | |
| bsize: None, | |
| compress_range: false, | |
| range_pruning_ratio: 0.0, // Default value for range_pruning_ratio | |
| } | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this nonsense?
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
This is a static pruning ratio, right? If I set range_pruning_ratio to 0.1, it keeps the 90% highest scoring postings for each posting list? |
|
Thanks @thefxperson, I have added some documentation in the README. Since I was on it, I decided to add pruning to the forward index as well. |
| compress_range: bool, | ||
| range_pruning_ratio: f32, | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this nonsense?
| } | ||
| } | ||
| pub fn insert_posting_list(&mut self, term_id: u32, posting_list: &Vec<(u32, u32)>) { | ||
| let mut sorted_scores: Vec<u32> = posting_list.iter().map(|&(_, score)| score).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say you definitely want to first check if the ratio is >0 and only then calculate this because it seems very wasteful if you didn't ask for pruning.
| fwd_pruning_ratio, | ||
| } | ||
| } | ||
| pub fn insert_posting_list(&mut self, term_id: u32, posting_list: &Vec<(u32, u32)>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a posting list in this context, is this the (term, frequency) pairs?
| - **`--range-pruning-ratio <RATIO>`**: Controls the pruning of range max scores in the inverted index. A value of 0.1 means the bottom 10% of range scores will be pruned. Default: 0.0 (no pruning). Higher values result in smaller indexes but may reduce recall. | ||
|
|
||
| - **`--fwd-pruning-ratio <RATIO>`**: Controls the pruning of forward index entries. A value of 0.2 means the bottom 20% of forward index scores will be pruned. Default: 0.0 (no pruning). This reduces memory usage for the forward index at the cost of potential recall loss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, so correct me if I'm wrong, but you seem to prune, say, 10% of lowest-scoring postings from each posting list, which is not the same as 10% of lowest-scoring postings in index, which to me this README suggests.
No description provided.