|
| 1 | +# Query Execution with the Integration Test Docker Cluster |
| 2 | + |
| 3 | +The integration test docker cluster can be used for the following tests: |
| 4 | +* SQL/PPL queries on Spark using local tables |
| 5 | +* SQL/PPL queries on Spark using external tables with data stored in MinIO(S3) |
| 6 | +* SQL/PPL queries on OpenSearch of OpenSearch indices |
| 7 | +* SQL/PPL async queries on OpenSearch of data stored in S3 |
| 8 | + |
| 9 | +In all cases, SQL or PPL queries be used and the processing is very similar. At most there may be a minor |
| 10 | +difference in the query request. |
| 11 | + |
| 12 | +## SQL/PPL Queries on Spark Using Local Tables |
| 13 | + |
| 14 | +Connect directly to the Spark master node and execute a query. Could connect using Spark Connect, submitting |
| 15 | +a job or even running `spark-shell` on the Docker container. Execute `sql()` calls on the SparkSession object. |
| 16 | + |
| 17 | +Local tables are tables that were created in Spark that are not external tables. The metadata and data is stored |
| 18 | +in the Spark master container. |
| 19 | + |
| 20 | +Spark will begin query processing by assuming that the query is a PPL query. If it fails to parse in PPL, then |
| 21 | +it will fall back to parsing it as a SQL query. |
| 22 | + |
| 23 | +After parsing the query, Spark will lookup the metadata for the table(s) and perform the query. The only other |
| 24 | +container that may be involved in processing the request is the Spark worker container. |
| 25 | + |
| 26 | +## SQL/PPL Queries on Spark Using External Tables with Data Stored in MinIO(S3) |
| 27 | + |
| 28 | +Connect directly to the Spark master node and execute a query. Could connect using Spark Connect, submitting |
| 29 | +a job or even running `spark-shell` on the Docker container. Execute `sql()` calls on the SparkSession object. |
| 30 | + |
| 31 | +External tables are tables that were created in Spark that have an `s3a://` location. The metadata is stored in |
| 32 | +Hive and the data is stored in MinIO(S3). |
| 33 | + |
| 34 | +Spark will begin query processing by assuming that the query is a PPL query. If it fails to parse in PPL, then |
| 35 | +it will fall back to parsing it as a SQL query. |
| 36 | + |
| 37 | +After parsing the query, Spark will lookup the metadata for the table(s) from Hive and perform the query. It |
| 38 | +will retrieve table data from MinIO(S3). |
| 39 | + |
| 40 | + |
| 41 | + |
| 42 | +## SQL/PPL Queries on OpenSearch of OpenSearch Indices |
| 43 | + |
| 44 | +Connect directly to the OpenSearch container to submit queries. Use the |
| 45 | +[SQL and PPL API](https://opensearch.org/docs/latest/search-plugins/sql/sql-ppl-api/). |
| 46 | + |
| 47 | +The indices are stored in the OpenSearch container. |
| 48 | + |
| 49 | +## SQL/PPL Async Queries on OpenSearch of Data Stored in S3 |
| 50 | + |
| 51 | +Connect directly to the OpenSearch container to submit queries. Use the |
| 52 | +[Async Query Interface](https://github.com/opensearch-project/sql/blob/main/docs/user/interfaces/asyncqueryinterface.rst). |
| 53 | +This type of query simulates querying an S3/Glue datasource in OpenSearch. |
| 54 | + |
| 55 | +The table metadata is stored in Hive and the table data is stored in MinIO(S3). |
| 56 | + |
| 57 | +There are three phases to query processing: |
| 58 | +1. Setup |
| 59 | +2. Processing |
| 60 | +3. Results Retrieval |
| 61 | + |
| 62 | +OpenSearch will use two special indices. |
| 63 | +1. `.query_execution_request_[DATASOURCE_NAME]` - In the integration test Docker cluster, the datasource is |
| 64 | + named `mys3`. When an Async Query request is received, an entry is added to this index. The entry contains |
| 65 | + the query as well as its state. The state is updated as the request is processed. |
| 66 | +2. `query_execution_result_[DATASOURCE_NAME]` - In the integration test Docker cluster, the datasource is |
| 67 | + named `mys3`. An entry is added to this index when the results are ready. The entry contains the results of |
| 68 | + the query. |
| 69 | + |
| 70 | +Temporary Docker containers are used. They are Apache Spark containers and run a jobs locally. |
| 71 | + |
| 72 | + |
| 73 | + |
| 74 | +### Setup |
| 75 | + |
| 76 | +The setup phase is started when OpenSearch receives an Async Query API request and continues until the query |
| 77 | +ID and session ID are returned to the client. |
| 78 | + |
| 79 | +1. Check if the index `.query_execution_request_[DATASOURCE_NAME]` exists. |
| 80 | +2. If `.query_execution_request_[DATASOURCE_NAME]` does not exist, then create it. |
| 81 | +3. Insert the request into `.query_execution_request_[DATASOURCE_NAME]` |
| 82 | +4. Return the query ID and session ID |
| 83 | + |
| 84 | +### Processing |
| 85 | + |
| 86 | +The processing phase started when checking if there is a container running for the request's session and |
| 87 | +continues until the results are added to the `query_execution_result_[DATASOURCE_NAME]`. |
| 88 | + |
| 89 | +1. Check if there is a Spark container already running for the request's session |
| 90 | +2. If a Spark container is not running for the request's session, then use Docker to start one. |
| 91 | + 1. Docker initializes and starts the Spark container for the session |
| 92 | +3. Spark container checks if the index `query_execution_result_[DATASOURCE_NAME]` exists. |
| 93 | +4. If the index `query_execution_result_[DATASOURCE_NAME]` does not exist, then create it. |
| 94 | +5. Spark container searches the `.query_execution_request_[DATASOURCE_NAME]` index for the next request |
| 95 | + in the session to process. |
| 96 | +6. Spark container identifies the tables in the query and get their metadata from the Hive container |
| 97 | +7. Spark container retrieves the table data from the MinIO(S3) container |
| 98 | +8. Spark container writes the results to the index `query_execution_result_[DATASOURCE_NAME]` |
| 99 | + |
| 100 | +The Spark container will keep looping from steps 5-8 until it reaches its timeout (currently set to 180 seconds). |
| 101 | +Once the timeout is received, the Spark container will shutdown. |
| 102 | + |
| 103 | +### Results Retrieval |
| 104 | + |
| 105 | +The results retrieval phase can happen any time after the results for the query have been added to the index |
| 106 | +`query_execution_result_[DATASOURCE_NAME]`. |
| 107 | + |
| 108 | +1. Client request the results of a previously submitted query from the OpenSearch container using the query ID |
| 109 | + received earlier. |
| 110 | +2. OpenSearch container searches the index `query_execution_result_[DATASOURCE_NAME]` for the results of the |
| 111 | + query. |
| 112 | +3. OpenSearch container returns the query results to the client. |
0 commit comments