Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Machine readable PPL function documentation. #1065

Open
ykmr1224 opened this issue Feb 21, 2025 · 19 comments
Open

[FEATURE] Machine readable PPL function documentation. #1065

ykmr1224 opened this issue Feb 21, 2025 · 19 comments
Labels
enhancement New feature or request Lang:PPL Pipe Processing Language support

Comments

@ykmr1224
Copy link
Collaborator

ykmr1224 commented Feb 21, 2025

Is your feature request related to a problem?

To provide context help for OpenSearch SQL/PPL while user is editing their query, I want better structured documentation where it can be automatically read and extracted description, parameters and those types.

What solution would you like?

Ideally it is documented as JSON, etc. which can be easily parsed.

What alternatives have you considered?

Extract those info from current markdown doc.

Do you have any additional context?
n/a

@ykmr1224 ykmr1224 added enhancement New feature or request untriaged labels Feb 21, 2025
@anasalkouz anasalkouz added the Lang:PPL Pipe Processing Language support label Feb 21, 2025
@anasalkouz
Copy link
Member

@LantaoJin can you please take a look to this?

@LantaoJin
Copy link
Member

We can leverage LLM to create an initial version based on current markdown docs, but how to ensure all docs could be auto converted to JSON format in feature? Ask developer to add a JSON doc manually seems very tricky. How above well format our existing markdown docs and leverage a markdown to json tool (for example https://github.com/njvack/markdown-to-json) as a long-term solution?

@LantaoJin
Copy link
Member

However, I'm also open to providing just a one-time conversion, because in the short term there probably won't be many new commands and functions added to this repository. But without automated conversion, our developers need to be particularly careful and manually modify the JSON files when updating any existing commands and functions.

@LantaoJin
Copy link
Member

LantaoJin commented Feb 25, 2025

@anasalkouz
Copy link
Member

Is the Apache Spark documentation machine readable? if yes, then I am leaning to use Apach Spark format to have consistent documentation for both SQL and PPL.

@ykmr1224
Copy link
Collaborator Author

Spark functions doc is coming from Scala docs in this file: https://github.com/apache/spark/blob/0184c5bf6670e5bde0f79b2ce64319fce813704f/sql/api/src/main/scala/org/apache/spark/sql/functions.scala

Should we follow the same way?
I see some functions doesn't have doc for arguments, but I think those are self explanatory ones.

@perinbehara
Copy link

perinbehara commented Feb 25, 2025

I do not think apache has that accurate information which is needed!

We are currently working on Autosuggest for all PPL functions.We just need some format from which we need to extract these details from:

Whats required to create JSON Format expected:
1.General description of function
2.How many syntaxes or usages
3.Description for each of the syntaxes
3.Mainly the parameters , types and return types
These are expected to be provided in order to help us create that json.

Else please suggest a better way to do it!

@Swiddis
Copy link
Contributor

Swiddis commented Feb 25, 2025

Since our existing documentation is rST, I'm inclined toward a system that already can handle rST well. Sphinx is widely-used and I've worked with it before. (Spark's docs linked above also use Sphinx, for the record)

It most commonly outputs HTML, but we can configure it to output doctree XML instead, which should be machine-readable. If we have a specific output format we need, we can also implement a custom Builder. This should limit the migration work from our existing system.

For a more long-term solution, we should move our function documentation to Javadoc comments.

This is easier to maintain since the documentation lives in the same place as its function. Javadocs already have a tag system for input and return parameters. The input and return data types become attached to the method signatures, which guarantees that they stay up to date.

In either case, with Sphinx, we can have it separately produce documentation for the OpenSearch site, for autosuggestion, and for any other consumers, just by changing the output configuration. Sphinx can also handle validating its input, so we can check the format directly in CI.

@perinbehara
Copy link

    {
      token: 'substr',
      detail:
        'string functions\n\nsubstr(str: string, startIndex: number): string',
      signatures: [
        {
          documentation: {
            value:
              'Returns a substring from the index specified by the number argument to the end of the string. If the function has a second number argument, it contains the length of\n' +
              ' the \n' +
              ' substring to be retrieved. For example, `substr("xyZfooxyZ",3, 3)` returns `"foo"`.',
            isTrusted: true,
            supportThemeIcons: false,
          },
          label: 'substr(str: string, startIndex: number): string',
          parameters: [
            { label: 'str: string' },
            { label: 'startIndex: number' },
          ],
        },
        {
          documentation: {
            value:
              'Returns a substring from the index specified by the number argument to the end of the string. If the function has a second number argument, it contains the length of\n' +
              ' the \n' +
              ' substring to be retrieved. For example, `substr("xyZfooxyZ",3, 3)` returns `"foo"`.',
            isTrusted: true,
            supportThemeIcons: false,
          },
          label:
            'substr(str: string, startIndex: number, length: number): string',
          parameters: [
            { label: 'str: string' },
            { label: 'startIndex: number' },
            { label: 'length: number' },
          ],
        },
      ],
    },

this would be the expected JSON

@Swiddis
Copy link
Contributor

Swiddis commented Feb 25, 2025

That format relies heavily on copying unprocessed strings (e.g. "label": "startIndex: number" instead of "label": { "name": "startIndex", "type": "number" }). It should be possible to do something similar with Doctree. I don't expect we need to invest too heavily in a custom builder for that.

For a more type-safe format that we can validate at build time, a custom builder would probably help more.

@perinbehara
Copy link

So, would it be suggested if we use Doctree to parse this json.If we are fixed on it,can you please share the format of doc tree expected so we can try to take mock and implement the automation process to generate this JSON.

@perinbehara
Copy link

is there any update regarding providing us the sample format so we can start working on it

@ykmr1224
Copy link
Collaborator Author

Why do we stick to this JSON format?
Specifically, why label and parameters are needed to be separate attributes?
Wondering why just a chunk of RST document or javadoc/scaladoc is not enough.

@perinbehara
Copy link

perinbehara commented Feb 27, 2025

The JSON format is needed to provide code intelligence features in the query editor - specifically signature help, which shows function documentation, parameter information, and return types when users are typing queries. This helps users understand how to use each function correctly.

We need that specific JSON format as we are using SignatureInformation interface of MonacoEditor API to display the description,label,parameters,different signatures that a perticular function allows, which is in the format of

    export interface SignatureInformation {
        label: string;
        documentation?: string | IMarkdownString;
        parameters: ParameterInformation[];
    }

For consistency, we should maintain the same structure as the existing Operation objects in both files, with:

- `token`: Function name
- `detail`: Brief signature info
- `signatures`: Array containing documentation and parameter details
   - `documentation.value`: Human-readable description
   - `label`: Function signature
   - `parameters`: Array of parameter objects

This format provides a good developer experience through signature help tooltips in the editor while maintaining consistency across both PPL and SQL query languages.

As of now we are targeting PPL for this Q1, we will figure out a way for that as well .
The ask is not to provide us the JSON file but we need some machine-readable file from where we can extract these values from.
Let me know if you have any questions on this!

@ykmr1224
Copy link
Collaborator Author

I think scaladoc would be ideas since it includes the documentation as well as function signatures.
Once single Scala file with scaladoc for available PPL functions are established, @perinbehara can write a script to convert that to JSON using existing scaladoc processor. (And the same way would work for SQL functions)

@LantaoJin
I think currently implementation of PPL functions are distributed in several files. Is it possible to have single place similar to functions.scala in Spark repository?

@perinbehara
Copy link

Thanks that would be great! @LantaoJin need your help on this,Also we need all the commands in one place as well , which helps us to automate syntax highlighting as well.

@LantaoJin
Copy link
Member

LantaoJin commented Feb 27, 2025

@LantaoJin
I think currently implementation of PPL functions are distributed in several files. Is it possible to have single place similar to functions.scala in Spark repository?

@ykmr1224 No, we don't maintain all the function definitions in this repo. We just transform PPL functions to Spark functions by

  1. Directly invoke Spark functions: such as ABS, POWER etc. (most of functions are in this type)
  2. Map PPL function names to Spark function names. such as DAY_OF_WEEK in PPL, map to dayofweek in Spark.
  3. Transform PPL function definition to spark built-int function definition if there is an alternative mapping (there are arguments mismatching), such as DATE_ADD in PPL, we transform to DateAddInterval expression in Spark.
  4. Transform to spark UDF if there is no functions existing in Spark, such as JSON_APPEND, CIDR.

But all functions list is included in class BuiltinFunctionName, see https://github.com/opensearch-project/opensearch-spark/blob/main/ppl-spark-integration/src/main/java/org/opensearch/sql/expression/function/BuiltinFunctionName.java.
And all the transform is included in class BuiltinFunctionTransformer. See

static Expression builtinFunction(org.opensearch.sql.ast.expression.Function function, List<Expression> args) {

@LantaoJin
Copy link
Member

Unlike the SQL Plugin project, spark-ppl only converts PPL syntax into Spark plans. Except for the last type which is implemented through ScalaUDF in our project, all other function definitions are in the Spark project. Therefore, currently the most complete function definitions are documented in the current markdown doc files through manual compilation.

@ykmr1224
Copy link
Collaborator Author

We had offline discussion.
We will go ahead with reorganizing RST document to make it more structured and machine readable. It is because PPL functions doesn't have actual java/scala functions like Spark functions, and we cannot have javadoc/scaladoc on the implementation. Another idea was to have a dummy scala inerface and scaladoc, but that will not sync with actual implementation and will require additional effort to sync it to documentation.

We will come up with template and replace existing docs based on he template.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Lang:PPL Pipe Processing Language support
Projects
None yet
Development

No branches or pull requests

5 participants