1
1
# Spark SQL Application
2
2
3
- This application execute sql query and store the result in OpenSearch index in following format
3
+ We have two applications: SQLJob and FlintJob.
4
+
5
+ SQLJob is designed for EMR Spark, executing SQL queries and storing the results in the OpenSearch index in the following format:
4
6
```
5
7
"stepId":"<emr-step-id>",
6
- "applicationId":"<spark-application-id>"
8
+ "applicationId":"<spark-application-id>",
7
9
"schema": "json blob",
8
10
"result": "json blob"
9
11
```
10
12
13
+ FlintJob is designed for EMR Serverless Spark, executing SQL queries and storing the results in the OpenSearch index in the following format:
14
+
15
+ ```
16
+ "jobRunId":"<emrs-job-id>",
17
+ "applicationId":"<spark-application-id>",
18
+ "schema": "json blob",
19
+ "result": "json blob",
20
+ "dataSourceName":"<opensearch-data-source-name>"
21
+ ```
22
+
11
23
## Prerequisites
12
24
13
25
+ Spark 3.3.1
@@ -16,8 +28,9 @@ This application execute sql query and store the result in OpenSearch index in f
16
28
17
29
## Usage
18
30
19
- To use this application , you can run Spark with Flint extension:
31
+ To use these applications , you can run Spark with Flint extension:
20
32
33
+ SQLJob
21
34
```
22
35
./bin/spark-submit \
23
36
--class org.opensearch.sql.SQLJob \
@@ -32,11 +45,41 @@ To use this application, you can run Spark with Flint extension:
32
45
<opensearch-region> \
33
46
```
34
47
48
+ FlintJob
49
+ ```
50
+ aws emr-serverless start-job-run \
51
+ --region <region-name> \
52
+ --application-id <application-id> \
53
+ --execution-role-arn <execution-role> \
54
+ --job-driver '{"sparkSubmit": {"entryPoint": "<flint-job-s3-path>", \
55
+ "entryPointArguments":["'<sql-query>'", "<result-index>", "<data-source-name>"], \
56
+ "sparkSubmitParameters":"--class org.opensearch.sql.FlintJob \
57
+ --conf spark.hadoop.fs.s3.customAWSCredentialsProvider=com.amazonaws.emr.AssumeRoleAWSCredentialsProvider \
58
+ --conf spark.emr-serverless.driverEnv.ASSUME_ROLE_CREDENTIALS_ROLE_ARN=<role-to-access-s3-and-opensearch> \
59
+ --conf spark.executorEnv.ASSUME_ROLE_CREDENTIALS_ROLE_ARN=<role-to-access-s3-and-opensearch> \
60
+ --conf spark.hadoop.aws.catalog.credentials.provider.factory.class=com.amazonaws.glue.catalog.metastore.STSAssumeRoleSessionCredentialsProviderFactory \
61
+ --conf spark.hive.metastore.glue.role.arn=<role-to-access-s3-and-opensearch> \
62
+ --conf spark.jars=<path-to-AWSGlueDataCatalogHiveMetaStoreAuth-jar> \
63
+ --conf spark.jars.packages=<flint-spark-integration-jar-name> \
64
+ --conf spark.jars.repositories=<path-to-download_spark-integration-jar> \
65
+ --conf spark.emr-serverless.driverEnv.JAVA_HOME=<java-home-in-emr-serverless-host> \
66
+ --conf spark.executorEnv.JAVA_HOME=<java-home-in-emr-serverless-host> \
67
+ --conf spark.datasource.flint.host=<opensearch-url> \
68
+ --conf spark.datasource.flint.port=<opensearch-port> \
69
+ --conf spark.datasource.flint.scheme=<http-or-https> \
70
+ --conf spark.datasource.flint.auth=<auth-type> \
71
+ --conf spark.datasource.flint.region=<region-name> \
72
+ --conf spark.datasource.flint.customAWSCredentialsProvider=com.amazonaws.emr.AssumeRoleAWSCredentialsProvider \
73
+ --conf spark.sql.extensions=org.opensearch.flint.spark.FlintSparkExtensions \
74
+ --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory "}}'
75
+ <data-source-name>
76
+ ```
77
+
35
78
## Result Specifications
36
79
37
80
Following example shows how the result is written to OpenSearch index after query execution.
38
81
39
- Let's assume sql query result is
82
+ Let's assume SQL query result is
40
83
```
41
84
+------+------+
42
85
|Letter|Number|
@@ -46,7 +89,7 @@ Let's assume sql query result is
46
89
|C |3 |
47
90
+------+------+
48
91
```
49
- OpenSearch index document will look like
92
+ For SQLJob, OpenSearch index document will look like
50
93
``` json
51
94
{
52
95
"_index" : " .query_execution_result" ,
@@ -68,6 +111,31 @@ OpenSearch index document will look like
68
111
}
69
112
```
70
113
114
+ For FlintJob, OpenSearch index document will look like
115
+ ``` json
116
+ {
117
+ "_index" : " .query_execution_result" ,
118
+ "_id" : " A2WOsYgBMUoqCqlDJHrn" ,
119
+ "_score" : 1.0 ,
120
+ "_source" : {
121
+ "result" : [
122
+ " {'Letter':'A','Number':1}" ,
123
+ " {'Letter':'B','Number':2}" ,
124
+ " {'Letter':'C','Number':3}"
125
+ ],
126
+ "schema" : [
127
+ " {'column_name':'Letter','data_type':'string'}" ,
128
+ " {'column_name':'Number','data_type':'integer'}"
129
+ ],
130
+ "jobRunId" : " s-JZSB1139WIVU" ,
131
+ "applicationId" : " application_1687726870985_0003" ,
132
+ "dataSourceName" : " myS3Glue" ,
133
+ "status" : " SUCCESS" ,
134
+ "error" : " "
135
+ }
136
+ }
137
+ ```
138
+
71
139
## Build
72
140
73
141
To build and run this application with Spark, you can run:
@@ -76,6 +144,8 @@ To build and run this application with Spark, you can run:
76
144
sbt clean sparkSqlApplicationCosmetic/publishM2
77
145
```
78
146
147
+ The jar file is located at ` spark-sql-application/target/scala-2.12 ` folder.
148
+
79
149
## Test
80
150
81
151
To run tests, you can use:
@@ -92,6 +162,12 @@ To check code with scalastyle, you can run:
92
162
sbt scalastyle
93
163
```
94
164
165
+ To check code with scalastyle, you can run:
166
+
167
+ ```
168
+ sbt testScalastyle
169
+ ```
170
+
95
171
## Code of Conduct
96
172
97
173
This project has adopted an [ Open Source Code of Conduct] ( ../CODE_OF_CONDUCT.md ) .
0 commit comments