[FEATURE] Machine readable PPL function documentation. #1065

ykmr1224 · 2025-02-21T17:31:19Z

Is your feature request related to a problem?

To provide context help for OpenSearch SQL/PPL while user is editing their query, I want better structured documentation where it can be automatically read and extracted description, parameters and those types.

What solution would you like?

Ideally it is documented as JSON, etc. which can be easily parsed.

What alternatives have you considered?

Extract those info from current markdown doc.

Do you have any additional context?
n/a

anasalkouz · 2025-02-24T17:36:27Z

@LantaoJin can you please take a look to this?

LantaoJin · 2025-02-25T02:38:35Z

We can leverage LLM to create an initial version based on current markdown docs, but how to ensure all docs could be auto converted to JSON format in feature? Ask developer to add a JSON doc manually seems very tricky. How above well format our existing markdown docs and leverage a markdown to json tool (for example https://github.com/njvack/markdown-to-json) as a long-term solution?

LantaoJin · 2025-02-25T02:46:17Z

However, I'm also open to providing just a one-time conversion, because in the short term there probably won't be many new commands and functions added to this repository. But without automated conversion, our developers need to be particularly careful and manually modify the JSON files when updating any existing commands and functions.

LantaoJin · 2025-02-25T06:30:55Z

We need to define what kind of format the function docs look like. Here are some examples:
Apache Spark: https://spark.apache.org/docs/latest/api/sql/index.html#built-in-functions
Databricks: https://docs.databricks.com/aws/en/sql/language-manual/functions/abs
Oracle: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/ATAN.html

anasalkouz · 2025-02-25T17:10:38Z

Is the Apache Spark documentation machine readable? if yes, then I am leaning to use Apach Spark format to have consistent documentation for both SQL and PPL.

ykmr1224 · 2025-02-25T17:10:59Z

Spark functions doc is coming from Scala docs in this file: https://github.com/apache/spark/blob/0184c5bf6670e5bde0f79b2ce64319fce813704f/sql/api/src/main/scala/org/apache/spark/sql/functions.scala

Should we follow the same way?
I see some functions doesn't have doc for arguments, but I think those are self explanatory ones.

perinbehara · 2025-02-25T18:01:00Z

I do not think apache has that accurate information which is needed!

We are currently working on Autosuggest for all PPL functions.We just need some format from which we need to extract these details from:

Whats required to create JSON Format expected:
1.General description of function
2.How many syntaxes or usages
3.Description for each of the syntaxes
3.Mainly the parameters , types and return types
These are expected to be provided in order to help us create that json.

Else please suggest a better way to do it!

Swiddis · 2025-02-25T22:47:39Z

Since our existing documentation is rST, I'm inclined toward a system that already can handle rST well. Sphinx is widely-used and I've worked with it before. (Spark's docs linked above also use Sphinx, for the record)

It most commonly outputs HTML, but we can configure it to output doctree XML instead, which should be machine-readable. If we have a specific output format we need, we can also implement a custom Builder. This should limit the migration work from our existing system.

For a more long-term solution, we should move our function documentation to Javadoc comments.

This is easier to maintain since the documentation lives in the same place as its function. Javadocs already have a tag system for input and return parameters. The input and return data types become attached to the method signatures, which guarantees that they stay up to date.

In either case, with Sphinx, we can have it separately produce documentation for the OpenSearch site, for autosuggestion, and for any other consumers, just by changing the output configuration. Sphinx can also handle validating its input, so we can check the format directly in CI.

perinbehara · 2025-02-25T22:52:09Z

    {
      token: 'substr',
      detail:
        'string functions\n\nsubstr(str: string, startIndex: number): string',
      signatures: [
        {
          documentation: {
            value:
              'Returns a substring from the index specified by the number argument to the end of the string. If the function has a second number argument, it contains the length of\n' +
              ' the \n' +
              ' substring to be retrieved. For example, `substr("xyZfooxyZ",3, 3)` returns `"foo"`.',
            isTrusted: true,
            supportThemeIcons: false,
          },
          label: 'substr(str: string, startIndex: number): string',
          parameters: [
            { label: 'str: string' },
            { label: 'startIndex: number' },
          ],
        },
        {
          documentation: {
            value:
              'Returns a substring from the index specified by the number argument to the end of the string. If the function has a second number argument, it contains the length of\n' +
              ' the \n' +
              ' substring to be retrieved. For example, `substr("xyZfooxyZ",3, 3)` returns `"foo"`.',
            isTrusted: true,
            supportThemeIcons: false,
          },
          label:
            'substr(str: string, startIndex: number, length: number): string',
          parameters: [
            { label: 'str: string' },
            { label: 'startIndex: number' },
            { label: 'length: number' },
          ],
        },
      ],
    },

this would be the expected JSON

Swiddis · 2025-02-25T23:43:43Z

That format relies heavily on copying unprocessed strings (e.g. "label": "startIndex: number" instead of "label": { "name": "startIndex", "type": "number" }). It should be possible to do something similar with Doctree. I don't expect we need to invest too heavily in a custom builder for that.

For a more type-safe format that we can validate at build time, a custom builder would probably help more.

perinbehara · 2025-02-26T00:11:34Z

So, would it be suggested if we use Doctree to parse this json.If we are fixed on it,can you please share the format of doc tree expected so we can try to take mock and implement the automation process to generate this JSON.

perinbehara · 2025-02-26T22:04:21Z

is there any update regarding providing us the sample format so we can start working on it

ykmr1224 · 2025-02-26T23:17:59Z

Why do we stick to this JSON format?
Specifically, why label and parameters are needed to be separate attributes?
Wondering why just a chunk of RST document or javadoc/scaladoc is not enough.

perinbehara · 2025-02-27T00:01:32Z

The JSON format is needed to provide code intelligence features in the query editor - specifically signature help, which shows function documentation, parameter information, and return types when users are typing queries. This helps users understand how to use each function correctly.

We need that specific JSON format as we are using SignatureInformation interface of MonacoEditor API to display the description,label,parameters,different signatures that a perticular function allows, which is in the format of

    export interface SignatureInformation {
        label: string;
        documentation?: string | IMarkdownString;
        parameters: ParameterInformation[];
    }

For consistency, we should maintain the same structure as the existing Operation objects in both files, with:

- `token`: Function name
- `detail`: Brief signature info
- `signatures`: Array containing documentation and parameter details
   - `documentation.value`: Human-readable description
   - `label`: Function signature
   - `parameters`: Array of parameter objects

This format provides a good developer experience through signature help tooltips in the editor while maintaining consistency across both PPL and SQL query languages.

As of now we are targeting PPL for this Q1, we will figure out a way for that as well .
The ask is not to provide us the JSON file but we need some machine-readable file from where we can extract these values from.
Let me know if you have any questions on this!

ykmr1224 · 2025-02-27T00:26:27Z

I think scaladoc would be ideas since it includes the documentation as well as function signatures.
Once single Scala file with scaladoc for available PPL functions are established, @perinbehara can write a script to convert that to JSON using existing scaladoc processor. (And the same way would work for SQL functions)

@LantaoJin
I think currently implementation of PPL functions are distributed in several files. Is it possible to have single place similar to functions.scala in Spark repository?

perinbehara · 2025-02-27T00:42:52Z

Thanks that would be great! @LantaoJin need your help on this,Also we need all the commands in one place as well , which helps us to automate syntax highlighting as well.

LantaoJin · 2025-02-27T09:23:17Z

@LantaoJin
I think currently implementation of PPL functions are distributed in several files. Is it possible to have single place similar to functions.scala in Spark repository?

@ykmr1224 No, we don't maintain all the function definitions in this repo. We just transform PPL functions to Spark functions by

Directly invoke Spark functions: such as ABS, POWER etc. (most of functions are in this type)
Map PPL function names to Spark function names. such as DAY_OF_WEEK in PPL, map to dayofweek in Spark.
Transform PPL function definition to spark built-int function definition if there is an alternative mapping (there are arguments mismatching), such as DATE_ADD in PPL, we transform to DateAddInterval expression in Spark.
Transform to spark UDF if there is no functions existing in Spark, such as JSON_APPEND, CIDR.

But all functions list is included in class BuiltinFunctionName, see https://github.com/opensearch-project/opensearch-spark/blob/main/ppl-spark-integration/src/main/java/org/opensearch/sql/expression/function/BuiltinFunctionName.java.
And all the transform is included in class BuiltinFunctionTransformer. See

opensearch-spark/ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/utils/BuiltinFunctionTransformer.java

Line 246 in b9a0dc5

    
           static Expression builtinFunction(org.opensearch.sql.ast.expression.Function function, List<Expression> args) {

LantaoJin · 2025-02-27T09:29:06Z

Unlike the SQL Plugin project, spark-ppl only converts PPL syntax into Spark plans. Except for the last type which is implemented through ScalaUDF in our project, all other function definitions are in the Spark project. Therefore, currently the most complete function definitions are documented in the current markdown doc files through manual compilation.

ykmr1224 · 2025-02-28T21:52:19Z

We had offline discussion.
We will go ahead with reorganizing RST document to make it more structured and machine readable. It is because PPL functions doesn't have actual java/scala functions like Spark functions, and we cannot have javadoc/scaladoc on the implementation. Another idea was to have a dummy scala inerface and scaladoc, but that will not sync with actual implementation and will require additional effort to sync it to documentation.

We will come up with template and replace existing docs based on he template.

ykmr1224 added enhancement untriaged labels Feb 21, 2025

anasalkouz added the Lang:PPL label Feb 21, 2025

ykmr1224 removed the untriaged label Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Machine readable PPL function documentation. #1065

[FEATURE] Machine readable PPL function documentation. #1065

ykmr1224 commented Feb 21, 2025 •

edited

Loading

anasalkouz commented Feb 24, 2025

LantaoJin commented Feb 25, 2025

LantaoJin commented Feb 25, 2025

LantaoJin commented Feb 25, 2025 •

edited

Loading

anasalkouz commented Feb 25, 2025

ykmr1224 commented Feb 25, 2025

perinbehara commented Feb 25, 2025 •

edited

Loading

Swiddis commented Feb 25, 2025 •

edited

Loading

perinbehara commented Feb 25, 2025

Swiddis commented Feb 25, 2025

perinbehara commented Feb 26, 2025

perinbehara commented Feb 26, 2025

ykmr1224 commented Feb 26, 2025

perinbehara commented Feb 27, 2025 •

edited

Loading

ykmr1224 commented Feb 27, 2025

perinbehara commented Feb 27, 2025

LantaoJin commented Feb 27, 2025 •

edited

Loading

LantaoJin commented Feb 27, 2025

ykmr1224 commented Feb 28, 2025

[FEATURE] Machine readable PPL function documentation. #1065

[FEATURE] Machine readable PPL function documentation. #1065

Comments

ykmr1224 commented Feb 21, 2025 • edited Loading

anasalkouz commented Feb 24, 2025

LantaoJin commented Feb 25, 2025

LantaoJin commented Feb 25, 2025

LantaoJin commented Feb 25, 2025 • edited Loading

anasalkouz commented Feb 25, 2025

ykmr1224 commented Feb 25, 2025

perinbehara commented Feb 25, 2025 • edited Loading

Swiddis commented Feb 25, 2025 • edited Loading

perinbehara commented Feb 25, 2025

Swiddis commented Feb 25, 2025

perinbehara commented Feb 26, 2025

perinbehara commented Feb 26, 2025

ykmr1224 commented Feb 26, 2025

perinbehara commented Feb 27, 2025 • edited Loading

ykmr1224 commented Feb 27, 2025

perinbehara commented Feb 27, 2025

LantaoJin commented Feb 27, 2025 • edited Loading

LantaoJin commented Feb 27, 2025

ykmr1224 commented Feb 28, 2025

ykmr1224 commented Feb 21, 2025 •

edited

Loading

LantaoJin commented Feb 25, 2025 •

edited

Loading

perinbehara commented Feb 25, 2025 •

edited

Loading

Swiddis commented Feb 25, 2025 •

edited

Loading

perinbehara commented Feb 27, 2025 •

edited

Loading

LantaoJin commented Feb 27, 2025 •

edited

Loading