-
Notifications
You must be signed in to change notification settings - Fork 169
Adding main segment implementation for API and indexing. #2653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding main segment implementation for API and indexing. #2653
Conversation
…earch-project#2624) Signed-off-by: Arun Ganesh <[email protected]>
log.info("Starting vector profiling for field: {}", fieldInfo.getName()); | ||
SegmentProfilerState.profileVectors(knnVectorValuesSupplier, segmentWriteState, fieldInfo.getName()); | ||
log.info("Completed vector profiling for field: {}", fieldInfo.getName()); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should we catch the exception in our SegmentProfileState class instead? We want to keep the business logic here relatively untouched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
|
||
private KNNStats knnStats; | ||
private ClusterService clusterService; | ||
// private IndicesService indicesService; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uncomment or remove this line. Why do we need IndicesService here? It's a pretty heavy object to be using inside KnnPlugin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
private KNNStats knnStats; | ||
private ClusterService clusterService; | ||
// private IndicesService indicesService; | ||
private Environment environment; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need Environment here? What information are we trying to get here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used to access the data directory paths and node-specific items. It's needed to pass to the RestKNNSamplingStatsHandler to locate and read statistics files stored on disk. The environment provides access to the node's data paths where vector statistics files are stored under the indices directory structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this - we can directly follow what's being used in the QuantizationState
Supplier<RepositoriesService> repositoriesServiceSupplier | ||
) { | ||
this.clusterService = clusterService; | ||
// this.indicesService = client.getInstanceFromNode(IndicesService.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not leave commented lines in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
this.clusterService = clusterService; | ||
// this.indicesService = client.getInstanceFromNode(IndicesService.class); | ||
this.repositoriesServiceSupplier = repositoriesServiceSupplier; | ||
this.environment = environment; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the environment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used to access the data directory paths and node-specific items. It's needed to pass to the RestKNNSamplingStatsHandler to locate and read statistics files stored on disk. The environment provides access to the node's data paths where vector statistics files are stored under the indices directory structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this - we can directly follow what's being used in the QuantizationState
clusterService, | ||
indexNameExpressionResolver, | ||
this.environment | ||
// indicesService |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
* Rest handler for sampling stats endpoint | ||
*/ | ||
public class RestKNNSamplingStatsHandler extends BaseRestHandler { | ||
// private final IndicesService indicesService; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
* @param indexNameExpressionResolver Resolver for index names | ||
* @param environment OpenSearch environment configuration | ||
*/ | ||
public RestKNNSamplingStatsHandler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: use lombok for @AllArgsConstructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
Environment environment | ||
// IndicesService indicesService | ||
) { | ||
// this.indicesService = indicesService; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
*/ | ||
@Override | ||
protected RestChannelConsumer prepareRequest(RestRequest request, NodeClient client) throws IOException { | ||
String indexName = request.param("index"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Let's not hardcode parameter strings here.
-
An OS index can consist of multiple vector fields so getting just the index could return multiple fields. Do we need to query by field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
*/ | ||
public static SegmentProfilerState profileVectors(final Supplier<KNNVectorValues<?>> knnVectorValuesSupplier) throws IOException { | ||
KNNVectorValues<?> vectorValues = knnVectorValuesSupplier.get(); | ||
private static void writeStatsToFile(Path outputFile, List<SummaryStatistics> statistics, String fieldName, int vectorCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: final whenever possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
private static void writeStatsToFile(Path outputFile, List<SummaryStatistics> statistics, String fieldName, int vectorCount) | ||
throws IOException { | ||
// Create parent directories if they don't exist | ||
Files.createDirectories(outputFile.getParent()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based off our conversation with @jmazanec15 and what we've discussed we may want to revisit how we're writing to the segment directory. Can we re-use the implementation used by the QuantizationStateWriter
to persist our state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added changes
// Create parent directories if they don't exist | ||
Files.createDirectories(outputFile.getParent()); | ||
|
||
try (XContentBuilder jsonBuilder = XContentFactory.jsonBuilder()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused here... can we leverage some sort of ObjectMapper
to simply write the SummaryStatistic object to the segment? That way we can avoid the XContentBuilder since we already have the object we need to serialize.
If we need to serialize additional metadata not included in SummaryStatistic we can also wrap it around a wrapper object and serialize that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Faced some difficulty with objectmapper to help serialize. Keeping XContentBuilder approach for now to accomodate for bigger changes. Can change more if necessary.
for (int i = 0; i < statistics.size(); i++) { | ||
SummaryStatistics stats = statistics.get(i); | ||
jsonBuilder.startObject() | ||
.field("dimension", i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't hardcode strings if possible in business logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added variables for string values
* @param fieldName Name of the field being processed | ||
* @return SegmentProfilerState containing collected statistics | ||
*/ | ||
public static SegmentProfilerState profileVectors( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
profileVectors
here returns a SegmentProfilerState
but I'm not seeing it used anywhere in the PR. Are we using the method signature anywhere? Should we change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using the profilerState to gather the statistics. I believe it should be fine to use for now as that is also used within the RestKNNSamplingStatsHandler
|
||
// Initialize vector values | ||
// Initialize new profiler state and vector values | ||
SegmentProfilerState profilerState = new SegmentProfilerState(new ArrayList<>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to create the SegmentProfilerState
here? Can we just return it if possible (assuming we need to return it)
…#2615) Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: Balasubramanian <[email protected]>
…ensearch-project#2647) * Add multi-vector-support faiss patch to IndexHNSW::search_level_0 Signed-off-by: AnnTian Shao <[email protected]> * Add tests to JNI and KNN Signed-off-by: AnnTian Shao <[email protected]> * Update tests by adding hnsw cagra index binary and remove JNI layer method updateIndexSettings Signed-off-by: AnnTian Shao <[email protected]> * test fixes Signed-off-by: AnnTian Shao <[email protected]> --------- Signed-off-by: AnnTian Shao <[email protected]> Co-authored-by: AnnTian Shao <[email protected]>
…oject#2646) * Combine method and lucene mappers to EngineFieldMapper Signed-off-by: Kunal Kotwani <[email protected]> * Change the default doc values to false, retain old value for flat field Signed-off-by: Kunal Kotwani <[email protected]> * Update flat field mapper checks Signed-off-by: Kunal Kotwani <[email protected]> * Fix the default doc value logic Signed-off-by: Kunal Kotwani <[email protected]> --------- Signed-off-by: Kunal Kotwani <[email protected]>
Signed-off-by: Vikasht34 <[email protected]>
…ject#2652) Fixes a bug that was already fixed in opensearch-project#2494 but was then reverted by accident in a refactor. It makes it so that instead of opening up readers for each transform request, it opens up once per reader. Signed-off-by: John Mazanec <[email protected]>
…ace with binary vectors (opensearch-project#2351) Signed-off-by: Bansi Kasundra <[email protected]>
…earch-project#2403) Signed-off-by: Neetika Singhal <[email protected]>
…opensearch-project#2650) Signed-off-by: Dooyong Kim <[email protected]> Co-authored-by: Dooyong Kim <[email protected]>
…earch-project#2641) Signed-off-by: luyuncheng <[email protected]>
…arch-project#2663) Signed-off-by: Dooyong Kim <[email protected]> Co-authored-by: Dooyong Kim <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
…arch-project#2639) Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: Dooyong Kim <[email protected]> Co-authored-by: Dooyong Kim <[email protected]>
…earch-project#2624) Signed-off-by: Arun Ganesh <[email protected]>
* RestHandler for k-NN index profile API. API provides the ability for a user to get statistical information | ||
* about vector dimensions in specific indices. | ||
*/ | ||
public class RestKNNProfileHandler extends BaseRestHandler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need more time to look into the API implementation for this PR since it's not too closely coupled with our profiling implementation
/** | ||
* Action for profiling KNN vectors in an index | ||
*/ | ||
public class KNNProfileAction extends ActionType<KNNProfileResponse> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we synced offline on this but is it possible to move the API implementation to another PR? I don't think there's
public static SegmentProfilerState getSegmentProfileState(final LeafReader leafReader, String fieldName) throws IOException { | ||
final SegmentProfileKNNCollector tempCollector = new SegmentProfileKNNCollector(); | ||
leafReader.searchNearestVectors(fieldName, new float[0], tempCollector, null); | ||
if (tempCollector.getSegmentProfilerState() == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to throw an exception if we're unable to get the SegmentProfilerState? What if the user has the feature disabled?
// Log the results for each dimension | ||
for (int i = 0; i < shardVectorProfile.size(); i++) { | ||
StatisticalSummaryValues stats = shardVectorProfile.get(i); | ||
log.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've mentioned this before but would it just be easier to use toString
here given we're not really focused on a specific output?
// For each leaf, collect the profile | ||
searcher.getIndexReader().leaves().forEach(leaf -> { | ||
try { | ||
log.info("[KNN] Processing leaf reader for segment: {}", leaf.reader()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we avoid clutter in the logs by reducing these info lines? This doesn't add any value here when debugging as we already have exception handling
* @return List of statistical summaries for each dimension | ||
*/ | ||
// TODO: Write unit tests to ensure that the segment statistic aggregation is correct. | ||
public List<StatisticalSummaryValues> profile(String fieldName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: final
* @param fieldName The name of the vector field to profile | ||
* @return List of statistical summaries for each dimension | ||
*/ | ||
// TODO: Write unit tests to ensure that the segment statistic aggregation is correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should write a integration test as well that covers across multiple segments. Let's leave this towards the end.
log.info("[KNN] Profiling completed for field: {} in shard: {}", fieldName, indexShard.shardId()); | ||
} catch (Exception e) { | ||
log.error( | ||
"[KNN] Critical error during profiling for field: {} in shard: {}: {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does Critical
mean in this sense? Error messages should be as succinct and objective as possible.
I looked at this PR from high level Overview , here are my suggestions that we might need to address |
Signed-off-by: Arun Ganesh <[email protected]>
10fa912
to
f183f6e
Compare
Description
Adding implementation for profiling and statistical analysis of KNN vector segments in OpenSearch. Provides functionality to collect, process, and store statistical information about vector dimensions across different shards and segments.
Example input:
Creating the index
Adding vectors to the index
Flushing the index
Observed example output:
Related Issues
Implements #2687
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check ✔️.