Skip to content

Conversation

@amallia
Copy link
Member

@amallia amallia commented Jul 28, 2025

No description provided.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds range pruning functionality to the inverted index by introducing a configurable range_pruning_ratio parameter. The feature allows selective inclusion of posting list entries based on their scores to optimize index size and performance.

  • Adds range_pruning_ratio field to IndexBuilder and CiffToBmp structs
  • Implements score-based pruning logic that filters posting list entries below a calculated threshold
  • Updates the CIFF to BMP conversion pipeline to support the new pruning parameter

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/index/inverted_index.rs Adds range_pruning_ratio field to IndexBuilder and implements pruning logic in posting list processing
src/ciff/mod.rs Adds range_pruning_ratio parameter to CiffToBmp struct and passes it through the conversion pipeline
Comments suppressed due to low confidence (1)

compress_range: bool,
range_pruning_ratio: f32,
}

Copy link

Copilot AI Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The range_pruning_ratio field lacks initialization in the struct. This could cause issues when creating new instances without explicitly setting this field.

Suggested change
impl Default for CiffToBmp {
fn default() -> Self {
Self {
input: None,
output: None,
bsize: None,
compress_range: false,
range_pruning_ratio: 0.0, // Default value for range_pruning_ratio
}
}
}

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this nonsense?

@thefxperson
Copy link

This is a static pruning ratio, right? If I set range_pruning_ratio to 0.1, it keeps the 90% highest scoring postings for each posting list?
I haven't ran the code myself, but it LGTM. I'd say add a paragraph in the readme to clarify the functionality, but otherwise it seems like a good addition.

@amallia
Copy link
Member Author

amallia commented Jul 29, 2025

Thanks @thefxperson, I have added some documentation in the README. Since I was on it, I decided to add pruning to the forward index as well.

compress_range: bool,
range_pruning_ratio: f32,
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this nonsense?

}
}
pub fn insert_posting_list(&mut self, term_id: u32, posting_list: &Vec<(u32, u32)>) {
let mut sorted_scores: Vec<u32> = posting_list.iter().map(|&(_, score)| score).collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say you definitely want to first check if the ratio is >0 and only then calculate this because it seems very wasteful if you didn't ask for pruning.

fwd_pruning_ratio,
}
}
pub fn insert_posting_list(&mut self, term_id: u32, posting_list: &Vec<(u32, u32)>) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a posting list in this context, is this the (term, frequency) pairs?

Comment on lines +38 to +40
- **`--range-pruning-ratio <RATIO>`**: Controls the pruning of range max scores in the inverted index. A value of 0.1 means the bottom 10% of range scores will be pruned. Default: 0.0 (no pruning). Higher values result in smaller indexes but may reduce recall.

- **`--fwd-pruning-ratio <RATIO>`**: Controls the pruning of forward index entries. A value of 0.2 means the bottom 20% of forward index scores will be pruned. Default: 0.0 (no pruning). This reduces memory usage for the forward index at the cost of potential recall loss.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, so correct me if I'm wrong, but you seem to prune, say, 10% of lowest-scoring postings from each posting list, which is not the same as 10% of lowest-scoring postings in index, which to me this README suggests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants