Skip to content

Commit 98579e1

Browse files
Added a framework for end-to-end tests (#1022)
* Added a framework for end-to-end tests * Only contains sample queries, not a full suite. All tests make use of the integration test docker cluster. * Can run the tests with "sbt e2etest/test" Signed-off-by: Norman Jordan <[email protected]> * Added documentation for integ-test cluster * Documented how queries are processed in the integ-test cluster * Documented how to use the Query Workbench with the integ-test cluster * Removed the shading of Jackson libraries (fixes #973) Signed-off-by: Norman Jordan <[email protected]> --------- Signed-off-by: Norman Jordan <[email protected]>
1 parent 3832906 commit 98579e1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+1229
-32
lines changed

build.sbt

+22-3
Original file line numberDiff line numberDiff line change
@@ -55,9 +55,6 @@ lazy val testScalastyle = taskKey[Unit]("testScalastyle")
5555
// - .inAll applies the rule to all dependencies, not just direct dependencies
5656
val packagesToShade = Seq(
5757
"com.amazonaws.cloudwatch.**",
58-
"com.fasterxml.jackson.core.**",
59-
"com.fasterxml.jackson.dataformat.**",
60-
"com.fasterxml.jackson.databind.**",
6158
"com.google.**",
6259
"com.sun.jna.**",
6360
"com.thoughtworks.paranamer.**",
@@ -325,6 +322,28 @@ lazy val integtest = (project in file("integ-test"))
325322
lazy val integration = taskKey[Unit]("Run integration tests")
326323
lazy val awsIntegration = taskKey[Unit]("Run AWS integration tests")
327324

325+
lazy val e2etest = (project in file("e2e-test"))
326+
.dependsOn(flintCommons % "test->package", flintSparkIntegration % "test->package", pplSparkIntegration % "test->package", sparkSqlApplication % "test->package")
327+
.settings(
328+
commonSettings,
329+
name := "e2e-test",
330+
scalaVersion := scala212,
331+
libraryDependencies ++= Seq(
332+
"org.scalatest" %% "scalatest" % "3.2.15" % "test",
333+
"org.apache.spark" %% "spark-connect-client-jvm" % "3.5.3" % "test",
334+
"com.amazonaws" % "aws-java-sdk-s3" % "1.12.568" % "test",
335+
"com.softwaremill.sttp.client3" %% "core" % "3.10.2" % "test",
336+
"com.softwaremill.sttp.client3" %% "play2-json" % "3.10.2",
337+
"com.typesafe.play" %% "play-json" % "2.9.2" % "test",
338+
),
339+
libraryDependencies ++= deps(sparkVersion),
340+
javaOptions ++= Seq(
341+
s"-DappJar=${(sparkSqlApplication / assembly).value.getAbsolutePath}",
342+
s"-DextensionJar=${(flintSparkIntegration / assembly).value.getAbsolutePath}",
343+
s"-DpplJar=${(pplSparkIntegration / assembly).value.getAbsolutePath}",
344+
)
345+
)
346+
328347
lazy val standaloneCosmetic = project
329348
.settings(
330349
name := "opensearch-spark-standalone",

docker/integ-test/configuration-updater/apply-configuration.sh

+46-26
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,26 @@ curl -q \
2020
-H 'Content-Type: application/json' \
2121
-d '{"name": "integ-test", "versioning": {"enabled": true, "excludePrefixes": [], "excludeFolders": false}, "locking": true}' \
2222
http://minio-S3:9001/api/v1/buckets
23-
# Create the access key
23+
# Create the test-resources bucket
2424
curl -q \
2525
-b /tmp/minio-cookies.txt \
2626
-X POST \
2727
-H 'Content-Type: application/json' \
28-
-d "{\"policy\": \"\", \"accessKey\": \"${S3_ACCESS_KEY}\", \"secretKey\": \"${S3_SECRET_KEY}\", \"description\": \"\", \"comment\": \"\", \"name\": \"\", \"expiry\": null}" \
29-
http://minio-S3:9001/api/v1/service-account-credentials
28+
-d '{"name": "test-resources", "versioning": {"enabled": false, "excludePrefixes": [], "excludeFolders": false}, "locking": true}' \
29+
http://minio-S3:9001/api/v1/buckets
30+
# Create the access key
31+
curl -q \
32+
-b /tmp/minio-cookies.txt \
33+
-X GET
34+
"http://minio-S3:9001/api/v1/service-accounts/${S3_ACCESS_KEY}"
35+
if [ "$?" -ne "0" ]; then
36+
curl -q \
37+
-b /tmp/minio-cookies.txt \
38+
-X POST \
39+
-H 'Content-Type: application/json' \
40+
-d "{\"policy\": \"\", \"accessKey\": \"${S3_ACCESS_KEY}\", \"secretKey\": \"${S3_SECRET_KEY}\", \"description\": \"\", \"comment\": \"\", \"name\": \"\", \"expiry\": null}" \
41+
http://minio-S3:9001/api/v1/service-account-credentials
42+
fi
3043

3144
# Login to OpenSearch Dashboards
3245
echo ">>> Login to OpenSearch dashboards"
@@ -43,31 +56,38 @@ if [ "$?" -eq "0" ]; then
4356
else
4457
echo " >>> Login failed"
4558
fi
59+
4660
# Create the S3/Glue datasource
47-
echo ">>> Creating datasource"
4861
curl -q \
4962
-b /tmp/opensearch-cookies.txt \
50-
-X POST \
51-
-H 'Content-Type: application/json' \
52-
-H 'Osd-Version: 2.18.0' \
53-
-H 'Osd-Xsrf: fetch' \
54-
-d "{\"name\": \"mys3\", \"allowedRoles\": [], \"connector\": \"s3glue\", \"properties\": {\"glue.auth.type\": \"iam_role\", \"glue.auth.role_arn\": \"arn:aws:iam::123456789012:role/S3Access\", \"glue.indexstore.opensearch.uri\": \"http://opensearch:9200\", \"glue.indexstore.opensearch.auth\": \"basicauth\", \"glue.indexstore.opensearch.auth.username\": \"admin\", \"glue.indexstore.opensearch.auth.password\": \"${OPENSEARCH_ADMIN_PASSWORD}\"}}" \
55-
http://opensearch-dashboards:5601/api/directquery/dataconnections
56-
if [ "$?" -eq "0" ]; then
57-
echo " >>> S3 datasource created"
58-
else
59-
echo " >>> Failed to create S3 datasource"
60-
fi
63+
-X GET \
64+
http://localhost:5601/api/directquery/dataconnections/mys3
65+
if [ "$?" -ne "0" ]; then
66+
echo ">>> Creating datasource"
67+
curl -q \
68+
-b /tmp/opensearch-cookies.txt \
69+
-X POST \
70+
-H 'Content-Type: application/json' \
71+
-H 'Osd-Version: 2.18.0' \
72+
-H 'Osd-Xsrf: fetch' \
73+
-d "{\"name\": \"mys3\", \"allowedRoles\": [], \"connector\": \"s3glue\", \"properties\": {\"glue.auth.type\": \"iam_role\", \"glue.auth.role_arn\": \"arn:aws:iam::123456789012:role/S3Access\", \"glue.indexstore.opensearch.uri\": \"http://opensearch:9200\", \"glue.indexstore.opensearch.auth\": \"basicauth\", \"glue.indexstore.opensearch.auth.username\": \"admin\", \"glue.indexstore.opensearch.auth.password\": \"${OPENSEARCH_ADMIN_PASSWORD}\"}}" \
74+
http://opensearch-dashboards:5601/api/directquery/dataconnections
75+
if [ "$?" -eq "0" ]; then
76+
echo " >>> S3 datasource created"
77+
else
78+
echo " >>> Failed to create S3 datasource"
79+
fi
6180

62-
echo ">>> Setting cluster settings"
63-
curl -v \
64-
-u "admin:${OPENSEARCH_ADMIN_PASSWORD}" \
65-
-X PUT \
66-
-H 'Content-Type: application/json' \
67-
-d '{"persistent": {"plugins.query.executionengine.spark.config": "{\"applicationId\":\"integ-test\",\"executionRoleARN\":\"arn:aws:iam::xxxxx:role/emr-job-execution-role\",\"region\":\"us-west-2\", \"sparkSubmitParameters\": \"--conf spark.dynamicAllocation.enabled=false\"}"}}' \
68-
http://opensearch:9200/_cluster/settings
69-
if [ "$?" -eq "0" ]; then
70-
echo " >>> Successfully set cluster settings"
71-
else
72-
echo " >>> Failed to set cluster settings"
81+
echo ">>> Setting cluster settings"
82+
curl -v \
83+
-u "admin:${OPENSEARCH_ADMIN_PASSWORD}" \
84+
-X PUT \
85+
-H 'Content-Type: application/json' \
86+
-d '{"persistent": {"plugins.query.executionengine.spark.config": "{\"applicationId\":\"integ-test\",\"executionRoleARN\":\"arn:aws:iam::xxxxx:role/emr-job-execution-role\",\"region\":\"us-west-2\", \"sparkSubmitParameters\": \"--conf spark.dynamicAllocation.enabled=false\"}"}}' \
87+
http://opensearch:9200/_cluster/settings
88+
if [ "$?" -eq "0" ]; then
89+
echo " >>> Successfully set cluster settings"
90+
else
91+
echo " >>> Failed to set cluster settings"
92+
fi
7393
fi

docker/integ-test/docker-compose.yml

+2-3
Original file line numberDiff line numberDiff line change
@@ -103,9 +103,8 @@ services:
103103
FLINT_JAR: ${FLINT_JAR}
104104
PPL_JAR: ${PPL_JAR}
105105
SQL_APP_JAR: ${SQL_APP_JAR}
106-
depends_on:
107-
metastore:
108-
condition: service_completed_successfully
106+
entrypoint: /bin/bash
107+
command: exit
109108

110109
opensearch:
111110
build: ./opensearch
Loading
Loading
Loading
Loading
Loading
Loading
+112
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Query Execution with the Integration Test Docker Cluster
2+
3+
The integration test docker cluster can be used for the following tests:
4+
* SQL/PPL queries on Spark using local tables
5+
* SQL/PPL queries on Spark using external tables with data stored in MinIO(S3)
6+
* SQL/PPL queries on OpenSearch of OpenSearch indices
7+
* SQL/PPL async queries on OpenSearch of data stored in S3
8+
9+
In all cases, SQL or PPL queries be used and the processing is very similar. At most there may be a minor
10+
difference in the query request.
11+
12+
## SQL/PPL Queries on Spark Using Local Tables
13+
14+
Connect directly to the Spark master node and execute a query. Could connect using Spark Connect, submitting
15+
a job or even running `spark-shell` on the Docker container. Execute `sql()` calls on the SparkSession object.
16+
17+
Local tables are tables that were created in Spark that are not external tables. The metadata and data is stored
18+
in the Spark master container.
19+
20+
Spark will begin query processing by assuming that the query is a PPL query. If it fails to parse in PPL, then
21+
it will fall back to parsing it as a SQL query.
22+
23+
After parsing the query, Spark will lookup the metadata for the table(s) and perform the query. The only other
24+
container that may be involved in processing the request is the Spark worker container.
25+
26+
## SQL/PPL Queries on Spark Using External Tables with Data Stored in MinIO(S3)
27+
28+
Connect directly to the Spark master node and execute a query. Could connect using Spark Connect, submitting
29+
a job or even running `spark-shell` on the Docker container. Execute `sql()` calls on the SparkSession object.
30+
31+
External tables are tables that were created in Spark that have an `s3a://` location. The metadata is stored in
32+
Hive and the data is stored in MinIO(S3).
33+
34+
Spark will begin query processing by assuming that the query is a PPL query. If it fails to parse in PPL, then
35+
it will fall back to parsing it as a SQL query.
36+
37+
After parsing the query, Spark will lookup the metadata for the table(s) from Hive and perform the query. It
38+
will retrieve table data from MinIO(S3).
39+
40+
![Queries for Spark Master](images/queries-for-spark-master.png "Queries for Spark Master")
41+
42+
## SQL/PPL Queries on OpenSearch of OpenSearch Indices
43+
44+
Connect directly to the OpenSearch container to submit queries. Use the
45+
[SQL and PPL API](https://opensearch.org/docs/latest/search-plugins/sql/sql-ppl-api/).
46+
47+
The indices are stored in the OpenSearch container.
48+
49+
## SQL/PPL Async Queries on OpenSearch of Data Stored in S3
50+
51+
Connect directly to the OpenSearch container to submit queries. Use the
52+
[Async Query Interface](https://github.com/opensearch-project/sql/blob/main/docs/user/interfaces/asyncqueryinterface.rst).
53+
This type of query simulates querying an S3/Glue datasource in OpenSearch.
54+
55+
The table metadata is stored in Hive and the table data is stored in MinIO(S3).
56+
57+
There are three phases to query processing:
58+
1. Setup
59+
2. Processing
60+
3. Results Retrieval
61+
62+
OpenSearch will use two special indices.
63+
1. `.query_execution_request_[DATASOURCE_NAME]` - In the integration test Docker cluster, the datasource is
64+
named `mys3`. When an Async Query request is received, an entry is added to this index. The entry contains
65+
the query as well as its state. The state is updated as the request is processed.
66+
2. `query_execution_result_[DATASOURCE_NAME]` - In the integration test Docker cluster, the datasource is
67+
named `mys3`. An entry is added to this index when the results are ready. The entry contains the results of
68+
the query.
69+
70+
Temporary Docker containers are used. They are Apache Spark containers and run a jobs locally.
71+
72+
![Queries for Async Query API](images/queries-for-async-api.png "Queries for Async Query API")
73+
74+
### Setup
75+
76+
The setup phase is started when OpenSearch receives an Async Query API request and continues until the query
77+
ID and session ID are returned to the client.
78+
79+
1. Check if the index `.query_execution_request_[DATASOURCE_NAME]` exists.
80+
2. If `.query_execution_request_[DATASOURCE_NAME]` does not exist, then create it.
81+
3. Insert the request into `.query_execution_request_[DATASOURCE_NAME]`
82+
4. Return the query ID and session ID
83+
84+
### Processing
85+
86+
The processing phase started when checking if there is a container running for the request's session and
87+
continues until the results are added to the `query_execution_result_[DATASOURCE_NAME]`.
88+
89+
1. Check if there is a Spark container already running for the request's session
90+
2. If a Spark container is not running for the request's session, then use Docker to start one.
91+
1. Docker initializes and starts the Spark container for the session
92+
3. Spark container checks if the index `query_execution_result_[DATASOURCE_NAME]` exists.
93+
4. If the index `query_execution_result_[DATASOURCE_NAME]` does not exist, then create it.
94+
5. Spark container searches the `.query_execution_request_[DATASOURCE_NAME]` index for the next request
95+
in the session to process.
96+
6. Spark container identifies the tables in the query and get their metadata from the Hive container
97+
7. Spark container retrieves the table data from the MinIO(S3) container
98+
8. Spark container writes the results to the index `query_execution_result_[DATASOURCE_NAME]`
99+
100+
The Spark container will keep looping from steps 5-8 until it reaches its timeout (currently set to 180 seconds).
101+
Once the timeout is received, the Spark container will shutdown.
102+
103+
### Results Retrieval
104+
105+
The results retrieval phase can happen any time after the results for the query have been added to the index
106+
`query_execution_result_[DATASOURCE_NAME]`.
107+
108+
1. Client request the results of a previously submitted query from the OpenSearch container using the query ID
109+
received earlier.
110+
2. OpenSearch container searches the index `query_execution_result_[DATASOURCE_NAME]` for the results of the
111+
query.
112+
3. OpenSearch container returns the query results to the client.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Using the Query Workbench in OpenSearch Dashboards
2+
3+
The integration test Docker cluster contains an OpenSearch Dashboards container. This container can be used
4+
as a web interface for querying data in the cluster.
5+
6+
[Query Workbench Documentation](https://opensearch.org/docs/latest/dashboards/query-workbench/)
7+
8+
## Logging in to OpenSearch Dashboards
9+
10+
* URL - `http://localhsot:5601`
11+
* Username: `admin`
12+
* Password: The password is in the file `docker/integ-test/.env`. It is the value of `OPENSEARCH_ADMIN_PASSWORD`.
13+
14+
## Querying the S3/Glue Datasource
15+
16+
1. Navigate to the Query Workbench
17+
2. Choose `Data source Connections` in the top left
18+
19+
![Data source Connections](images/datasource-selector.png "Data source Connections")
20+
3. In the drop-down below `Data source Connections`, select the S3/Glue datasource. It is named `mys3`.
21+
22+
![Data source Drop-down](images/datasource-drop-down.png "Data source Drop-down")
23+
4. It may take some time to load the namespaces in the datasource. `mys3` only contains the namespace `default`.
24+
5. If you like, you can browse the tables in the `default` namespace by clicking on `default`.
25+
26+
![Data source Browser](images/datasource-browser.png "Data source Browser")
27+
6. Execute a Query
28+
29+
![Query Interface](images/query-workbench-query.png "Query Interface")
30+
1. Choose the query language by clicking on `SQL` or `PPL`
31+
2. Enter a query in the text box
32+
3. Click `Run` to execute the query
33+
4. The results are displayed in the bottom right part of the page

0 commit comments

Comments
 (0)