Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Missing support for simple_pattern_split and simple_pattern tokenizers #1444

Closed
mcb-sprout opened this issue Feb 18, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@mcb-sprout
Copy link

What is the bug?

The client throws an exception when attempting to parse an index which settings include a simple_pattern_split or simple_pattern tokenizer.

IndexSettings cannot be deserialized from settings using either of these tokenizers preventing them from being used in a CreateIndexRequest. Using the client to make a GetIndexRequest for an index using these settings throws the same exception.

Exception thrown:
org.opensearch.client.util.MissingRequiredPropertyException: Missing required property 'Builder.<variant kind>'

How can one reproduce the bug?

Reproduce the bug by deserializing from JSON:

String JSON = """
        {
          "analysis": {
            "tokenizer": {
              "my_pattern_split_tokenizer": {
                "type": "simple_pattern_split",
                "pattern": "-"
              }
            },
            "analyzer": {
              "my_pattern_split_analyzer": {
                "type": "custom",
                "tokenizer": "my_pattern_split_tokenizer"
              }
            }
          }
        }
    """;

JsonpMapper mapper = client._transport().jsonpMapper();
JsonParser parser = mapper.jsonProvider().createParser(new StringReader(JSON));

IndexSettings settings = IndexSettings._DESERIALIZER.deserialize(parser, mapper);

Reproduce the bug by getting an index which was created using these settings:

GetIndexRequest req = new GetIndexRequest.Builder()
        .index("test-index")
        .build();
GetIndexResponse resp = client.indices().get(req);

What is the expected behavior?

IndexSettings should be able to be deserialized from these settings because according to the documentation they're still supported tokenizers. The client should be able to get data for an index which uses these settings.

What is your host/environment?

macOS Sequoia 15.3

Do you have any additional context?

These settings work when reaching out to OpenSearch directly and appear to be supported by the High Level Rest Client. I'm encountering this issue now that I'm trying to migrate to the Java client. These tokenizer types aren't present in the TokenizerDefinition.

OpenSearch DSL:

PUT test-index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_pattern_split_tokenizer": {
          "type": "simple_pattern_split",
          "pattern": "-"
        }
      },
      "analyzer": {
        "my_pattern_split_analyzer": {
          "type": "custom",
          "tokenizer": "my_pattern_split_tokenizer"
        }
      }
    }
  }
}
@Xtansia
Copy link
Collaborator

Xtansia commented Mar 6, 2025

This was released as part of v2.22.0

@Xtansia Xtansia closed this as completed Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants