Skip to content

Commit 0d96879

Browse files
authored
[DOC] Documentation for PPL new engine (V3) and limitations of 3.0.0 Beta (#3488)
1 parent d22680b commit 0d96879

16 files changed

+1055
-17
lines changed

docs/category.json

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,9 @@
2828
"user/ppl/cmd/trendline.rst",
2929
"user/ppl/cmd/top.rst",
3030
"user/ppl/cmd/where.rst",
31+
"user/ppl/cmd/join.rst",
32+
"user/ppl/cmd/lookup.rst",
33+
"user/ppl/cmd/subquery.rst",
3134
"user/ppl/general/identifiers.rst",
3235
"user/ppl/general/datatypes.rst",
3336
"user/ppl/functions/condition.rst",

docs/dev/img/alternative.png

544 KB
Loading
Loading

docs/dev/img/relbuilder.png

120 KB
Loading

docs/dev/intro-v3-architecture.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# OpenSearch PPL V3 Engine Architecture
2+
3+
---
4+
## 1. Overview
5+
6+
Apache Calcite (https://calcite.apache.org/) is an open source framework that provides an industry-standard SQL parser, a highly extensible query optimization framework that allows developers to customize and plug in their own rules for improving query performance. It is widely adopted by a variety of organizations and has gained industry support from major players in the data management and analytics space. It is often used as a foundational technology for building custom data management solutions and analytical applications.
7+
8+
Apache Calcite provides
9+
10+
- Parser: parse a SQL statement to an abstract syntax tree (AST)
11+
- Validator: convert AST to logical plan with pluggable catalog
12+
- Optimizer: optimizer the logical plan with rule based optimizer (RBO) or cost based optimizer (CBO).
13+
- Executor: convert optimized plan to Linq expression and execute it via Linq4j connector.
14+
- Translator: translate SQL from one dialect to another.
15+
16+
OpenSearch current support 3 type of query languages DSL, SQL, and PPL. DSL (Domain-Specific Language) is JSON-based and designed to define queries at a lower level (consider it as an executable physical plan). SQL and PPL are implemented in SQL plugin (as v2 engine) and both support support relational operations and concepts such as databases, tables, schemas, and columns. OpenSearch PPL has gradually become the native query language of OpenSearch.
17+
18+
PPL is specifically designed to simplify tasks in observability and security analytics, making it easier for users to analyze and understand their data. Its syntax and concepts align with familiar languages like Splunk SPL and SQL. Although PPL is a native part of OpenSearch, it’s designed independently of any specific query engine, allowing other query engines to adopt it and making it flexible enough to handle structured, semi-structured, and nested data across various platforms.
19+
20+
---
21+
## 2. Pain points
22+
23+
### 2.1. PPL lacks the ability to handle complex queries.
24+
The current PPL engine (shared with SQL v2 engine) is built with custom components, including a parser, analyzer, optimizer, and relies heavily on OpenSearch DSL capabilities to execute query plans. By aligning its syntax and concepts with familiar languages like Splunk SPL and SQL, we aim to streamline migration for users from these backgrounds, allowing them to adopt PPL with minimal effort. The lack of comprehensive ability is a critical blocker for Splunk-to-OpenSearch migrations. We added ~20 new commands in PPL-on-Spark, but there are still dozens of command gaps to be filled. Not to mention that there are still a large number of functions to be implemented.
25+
26+
### 2.2 Lack of Unified PPL Experience
27+
The PPL language is currently inconsistent across [PPL-on-OpenSearch](https://github.com/opensearch-project/sql/blob/main/ppl/src/main/antlr/OpenSearchPPLParser.g4) and [PPL-on-Spark](https://github.com/opensearch-project/opensearch-spark/blob/main/ppl-spark-integration/src/main/antlr4/OpenSearchPPLParser.g4). There are a lot of new commands added in PPL-on-Spark, such as `join`, `lookup` and `subsearch` are not yet supported in PPL-on-OpenSearch. As more and more new commands and functions are implemented in PPL-on-Spark, this gap will continue to widen.
28+
29+
### 2.3 Lack of mature query optimizer
30+
Although the v2 engine framework comes with an optimizer class, it only has a few pushdown optimization rules and lacks of mature optimization rules and cost-based optimizer like those found in traditional databases. Query performance and scalability are core to PPL's design, enabling it to efficiently handle high-performance queries and scale to support large datasets and complex queries.
31+
32+
### 2.4 Lacks the robustness
33+
While PPL has 5,000+ integration tests, it lacks the robustness of more mature systems like SQLite (with over 1 million tests) or PostgreSQL (with 60,000+ tests), impacting both performance and development effort.
34+
35+
---
36+
## 3. High Level View
37+
38+
### 3.1 Proposal
39+
To address above pain points, the V3 engine integrates Apache Calcite as a basic component.
40+
41+
- Apache Calcite is a straightforward and in common use library to help us to solve the **first pain point**. It fully supports the ANSI SQL (even is very complex) and contains many built-in functions of commercial databases.
42+
- The **second pain point** is mainly due to the significant development efforts required to implement a large number of PPL commands for different engines (Spark and OpenSearch). Apache Calcite cannot solve this pain point for short-term, but for long-term, with Apache Calcite, the PPL statements can be translated to various SQL dialects, including SparkSQL, then the current implementation of PPL-on-Spark could be depreciated. the implementation of PPL-on-OpenSearch and PPL-on-Spark can be unified.
43+
- The built-in RBO and CBO optimizers in Apache Calcite, as well as features such as materialized views, can effectively meet and solve the **third pain point**.
44+
- Apache Calcite is widely adopted by a variety of organizations and has gained industry support from major players in the data management and analytics space. Although it does not have the same level of robustness as commercial databases, the Apache Calcite open source community is still active. Absolutely, using existing Apache Calcite code for logical plan optimization and physical operator execution has higher robustness and lower cost than writing code from scratch which solves the **fourth pain point**.
45+
46+
### 3.2 Architecture
47+
48+
The current PPL grammar and existing query AST have widely used (both in PPL-on-OpenSearch and PPL-on-Spark). In addition, the ANTLR4 grammar is used in polyglot validators in fronted and backend (Java validator, J/S validator, Python validator). To keep the ANTLR4 grammar and reuse existing AST, the architecture of integration looks:
49+
![Architecture Overview](img/architecture-of-calcite-integration.png)
50+
51+
---
52+
## 4. Implementation
53+
54+
### 4.1 PPL-on-OpenSearch
55+
```
56+
PPL -> ANTLR -> AST -> RelNode(Calcite) -> EnumerableRel(Calcite) -> OpenSearchEnumerableRel -> OpenSearch API
57+
SQL -> ANTLR -> AST -> RelNode(Calcite) -> EnumerableRel(Calcite)-> OpenSearchEnumerableRel -> OpenSearch API
58+
```
59+
60+
### 4.2 PPL-on-Spark
61+
```
62+
Short-term: PPL -> ANTLR -> AST -> LogicalPlan(Spark) -> PhysicalPlan(Spark) -> tasks (Spark runtime)
63+
Long-term: PPL -> ANTLR -> AST -> RelNode(Calcite) -> SparkSQL API -> tasks (Spark runtime)
64+
```
65+
In this implementation, the main tasks including:
66+
67+
1. Add a PPLParser (ANTLR4 based) to parse PPL statement into existing AST nodes
68+
2. Convert OpenSearch schemas into OpenSearchRelDatatype (partly reuse existing code)
69+
3. Traverse through AST and convert them into RexNodes and RelNodes
70+
4. Map some basic PPL UDFs to Calcite SQL operators
71+
5. Build Calcite UDFs for any other PPL UDFs
72+
6. Have optimizer rules to optimize OpenSearch specific cases and commands.
73+
7. Have all pushdown features working
74+
8. Implement other RelNodes so that PPL can be translated into SparkSQL
75+
76+
### 4.3 RelBuilder
77+
78+
PPL is a non-SQL front-end language, we add a PPL parser and build its AST to relational algebra using `RelBuilder` and execute query against OpenSearch API.
79+
80+
The `RelBuilder` was created for supporting third-part front-end language which are other than SQL. It builds relational algebra and it's fairly straightforward if you want to write a new relational language, you write a parser and generate relational algebra using RelBuilder. And the whole backend Calcite knows executing that's the query against execution system via adaptor.
81+
![RelBuilder](img/relbuilder.png)
82+
83+
---
84+
## 5. Appendix
85+
86+
### 5.1 Industry Usage - [Hadoop Pig Latin](https://pig.apache.org/docs/r0.17.0/basic.html)
87+
88+
Pig Latin statements are the basic constructs uses to process data using Apache Pig. A Pig Latin statement is an operator that takes a [relation](https://pig.apache.org/docs/r0.17.0/basic.html#relations) as input and produces another relation as output. Pig Latin statements may include [expressions](https://pig.apache.org/docs/r0.17.0/basic.html#expressions) and [schemas](https://pig.apache.org/docs/r0.17.0/basic.html#schemas).
89+
90+
In Linkedin, they created a internal Calcite repo to convert Pig Latin scripts into Calcite Logical Plan (RelNode), including:
91+
92+
- Use Pig parser to parse Pig Latin scripts into Pig logical plan (AST)
93+
- Convert Pig schemas into RelDatatype
94+
- Traverse through Pig expressions and convert Pig expressions into RexNodes
95+
- Traverse through Pig logical plans to convert each Pig logical nodes to RelNodes
96+
- Map some basic Pig UDFs to Calcite SQL operators
97+
- Build Calcite UDFs for any other Pig UDFs, including UDFs written in both Java and Python
98+
- Have an optimizer rule to optimize Pig group/cogroup into Aggregate operators
99+
- Implement other RelNode in Rel2Sql so that Pig Latin can be translated into SQL
100+
101+
This [work](https://issues.apache.org/jira/browse/CALCITE-3122) had contributed to Apache Calcite and named [Piglet](https://calcite.apache.org/javadocAggregate/org/apache/calcite/piglet/package-summary.html). It allows users to write queries in Pig Latin, and execute them using any applicable Calcite adapter.
102+
103+
Pig Latin leverage `RelBuilder` to implement as a third-part front-end language (dialect).
104+
105+
### 5.2 Other Alternative
106+
107+
Current proposal is reusing existing AST nodes and leveraging `RelBuilder` to implement as a third-part front-end language.
108+
An alternative would be adding a `PPLNode`, similar to `SqlNode` as new AST nodes, and new `PPLValidator` to resolve this new AST with catalog metadata, then creating a `PPLToRelConverter` (similar to `SqlToRelConverter`) for converting `PPLNode`(AST) to `RelNode`(logical plan). See the picture following.
109+
![alternative](img/alternative.png)
110+

docs/dev/intro-v3-engine.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# PPL Engine V3 (for 3.0.0-beta)
2+
3+
---
4+
## 1. Motivations
5+
6+
Previously, we developed [SQL engine V2](../../docs/dev/intro-v2-engine.md) to support both SQL and PPL queries. However, as the complexity of supported SQL and PPL increased, the engine's limitations became increasingly apparent. Two major issues emerged:
7+
8+
1. Insufficient support for complex SQL/PPL queries: The development cycle for new commands such as `Join` and `Subquery` was lengthy, and it was difficult to achieve high robustness.
9+
10+
2. Lack of advanced query plan optimization: The V2 engine only supports a few pushdown optimization for certain operators and lacks of mature optimization rules and cost-based optimizer like those found in traditional databases. Query performance and scalability are core to design of PPL, enabling it to efficiently handle high-performance queries and scale to support large datasets and complex queries.
11+
12+
### Why Apache Calcite?
13+
14+
Introducing Apache Calcite brings serval significant advantages:
15+
16+
1. Enhanced query plan optimization capabilities: Calcite's optimizer can effectively optimize execution plans for both complex SQL and PPL queries.
17+
18+
2. Simplified development of new commands and functions: Expanding PPL commands is one of the key targets to enhancing the PPL language. Calcite helps streamline the development cycle for new commands and functions.
19+
20+
3. Decoupled execution layer: Calcite can be used for both query optimization and execution, or solely for query optimization while delegating execution to other backends such as DataFusion or Velox."
21+
22+
Find more details in [V3 Architecture](./intro-v3-architecture.md).
23+
24+
---
25+
## 2. What's New
26+
27+
In the initial release of the V3 engine (3.0.0-beta), the main new features focus on enhancing the PPL language while maintaining maximum compatibility with V2 behavior.
28+
29+
* **[Join](../user/ppl/cmd/join.rst) Command**
30+
* **[Lookup](../user/ppl/cmd/lookup.rst) Command**
31+
* **[Subquery](../user/ppl/cmd/subquery.rst) Command**
32+
33+
---
34+
## 3.What are Changed
35+
36+
### 3.1 Breaking Changes
37+
38+
Because of implementation changed internally, following behaviors are changed from 3.0.0-beta. (Behaviors in V3 is correct)
39+
40+
| Item | V2 | V3 |
41+
|:------------------------------------------------:|:---------:|:--------------------:|
42+
| Return type of `timestampdiff` | timestamp | int |
43+
| Return type of `regexp` | int | boolean |
44+
| Return type of `count`,`dc`,`distinct_count` | int | bigint |
45+
| Return type of `ceiling`,`floor`,`sign` | int | same type with input |
46+
| like(firstname, 'Ambe_') on value "Amber JOHnny" | true | false |
47+
| like(firstname, 'Ambe*') on value "Amber JOHnny" | true | false |
48+
| cast(firstname as boolean) | false | null |
49+
| Sum multiple `null` values when pushdown enabled | 0 | null |
50+
51+
52+
### 3.2 Fallback Mechanism
53+
54+
As v3 engine is experimental in 3.0.0-beta, not all PPL commands could work under this new engine. Those unsupported queries will be forwarded to V2 engine by fallback mechanism. To avoid impact on your side, normally you won't see any difference in a query response. If you want to check if and why your query falls back to be handled by V2 engine, please check OpenSearch log for "Fallback to V2 query engine since ...".
55+
56+
### 3.3 Limitations
57+
58+
For the following functionalities in V3 engine, the query will be forwarded to the V2 query engine and thus you cannot use new features in [2. What's New](#2-whats-new).
59+
60+
#### Unsupported functionalities
61+
- All SQL queries
62+
- `trendline`
63+
- `show datasource`
64+
- `explain`
65+
- `describe`
66+
- `top` and `rare`
67+
- `fillnull`
68+
- `patterns`
69+
- `dedup` with `consecutive=true`
70+
- Search relevant commands
71+
- AD
72+
- ML
73+
- Kmeans
74+
- Commands with `fetch_size` parameter
75+
- query with metadata fields, `_id`, `_doc`, etc.
76+
- Json relevant functions
77+
- cast to json
78+
- json
79+
- json_valid
80+
- Search relevant functions
81+
- match
82+
- match_phrase
83+
- match_bool_prefix
84+
- match_phrase_prefix
85+
- simple_query_string
86+
- query_string
87+
- multi_match
88+
- [Existed limitations of V2](intro-v2-engine.md#33-limitations)
89+
90+
---
91+
## 4.How it's Implemented
92+
93+
If you're interested in the new query engine, please find more details in [V3 Architecture](./intro-v3-architecture.md).
94+
95+
---
96+
## 5. What's Next
97+
98+
The following items are on our roadmap with high priority:
99+
- Resolve the [V3 limitation](#33-limitations).
100+
- Advancing pushdown optimization and benchmarking
101+
- Backport to 2.19.x
102+
- Unified the PPL syntax between [PPL-on-OpenSearch](https://github.com/opensearch-project/sql/blob/main/ppl/src/main/antlr/OpenSearchPPLParser.g4) and [PPL-on-Spark](https://github.com/opensearch-project/opensearch-spark/blob/main/ppl-spark-integration/src/main/antlr4/OpenSearchPPLParser.g4)
103+
- Support more DSL aggregation
104+

0 commit comments

Comments
 (0)