Oracle NoSQL Database Integration

Anand Chandak · Anand Chandak · commit 05695395b552 · 2021-06-09T18:23:57.000+05:30
diff --git a/README.md b/README.md
@@ -32,6 +32,7 @@ You must have Set Up Your Tenancy and be able to Access Data Flow
 | CSV to Parquet  |This application shows how to use PySpark to convert CSV data store in OCI Object Store to Apache Parquet format which is then written back to Object Store.     |[CSV to Parquet](./python/csv_to_parquet)| [CSV to Parquet](./java/csv_to_parquet)| [CSV to Parquet](./scala/csv_to_parquet)|
 | Load to ADW     |This application shows how to read a file from OCI Object Store, perform some transformation and write the results to an Autonomous Data Warehouse instance.           |[Load to ADW](./python/loadadw)| [Load to ADW](./java/loadadw)|[Load to ADW](./scala/loadadw)|
 | Random Forest Regression       |This application shows how to build a model and make prediction using Random Forest Regression.                                                             |[Random Forest Regression](./python/random_forest_regression)|
+| Oracle NoSQL Database cloud service       |This application shows how to interface with Oracle NoSQL Database cloud service.                                                             |[Oracle NoSQL Database cloud service](./python/nosql_example.py)|
 
 For step-by-step instructions, see the README files included with each sample.
 
diff --git a/python/oracle_nosql/.gitignore b/python/oracle_nosql/.gitignore
@@ -0,0 +1,3 @@
+__pycache__
+logs
+archive.zip
diff --git a/python/oracle_nosql/README.md b/python/oracle_nosql/README.md
@@ -0,0 +1,80 @@
+# Overview
+
+This example shows you how to use OCI Data Flow to interface with Oracle NoSQL Database Cloud Service.
+
+## Prerequisites
+
+Before you begin:
+
+1. Ensure your tenant is configured for Data Flow by following [instructions](https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_getting_started.htm#set_up_admin)
+2. Provision an Oracle NoSQL Database cloud service table.
+3. Download the Oracle NoSQL Database python sdk. The home for the project is [here](https://nosql-python-sdk.readthedocs.io/en/stable/index.html)
+   The SDK can be installed using pip:
+
+   ```bash
+      pip install borneo
+    ```  
+
+   * See [the installation guide](https://nosql-python-sdk.readthedocs.io/en/stable/installation.html) for additional requirements and and alternative install methods.
+
+4. (Optional, strongly recommended): Install Spark to test your code locally before deploying to Data Flow.
+
+## Application Setup
+
+Customize ```nosql_example.py``` with:
+
+* Set COMPARTMENT_ID to the Oracle NoSQL Database Cloud service table.
+* Set ENDPOINT to region that has Oracle NoSQL Database Cloud service table for e.g. ``us-ashburn-1``.
+* Set TABLE_NAME to the table in Oracle NoSQL Database cloud service table.
+* Set INDEX_NAME to the name of the index to create in Oracle NoSQL Database cloud service table.
+
+## Testing Locally
+
+Test the Application Locally (recommended):
+
+  ```bash
+  python nosql_example.py
+  ```
+
+## Packaging your Application
+
+* Create the Data Flow Dependencies Archive as follows:
+
+```bash
+   docker pull phx.ocir.io/oracle/dataflow/dependency-packager:latest
+   docker run --rm -v $(pwd):/opt/dataflow -it phx.ocir.io/oracle/dataflow/dependency-packager:latest
+ ```
+
+* Confirm you have a file named **archive.zip** with the Oracle NoSQL Database python SDK in it.
+
+## Deploy and Run the Application
+
+* Copy ```nosql_example.py``` to object store.
+* Copy the ```archive.zip``` generated while packaging the application to object store.
+* Create a Data Flow Python application. Be sure to include archive.zip as the dependency archive.
+  * Refer [here](https://docs.cloud.oracle.com/en-us/iaas/data-flow/using/dfs_data_flow_library.htm#create_pyspark_app) for more information.
+* Run the application.
+
+## Run the Application using OCI Cloud Shell or OCI CLI
+
+Create a bucket. Alternatively you can re-use an existing bucket.
+
+```sh
+oci os object put --bucket-name <bucket> --file nosql_example.py
+oci os object put --bucket-name <bucket> --file archive.zip
+oci data-flow application create \
+    --compartment-id <compartment_ocid> \
+    --display-name "Oracle NoSQL Example" \
+    --driver-shape VM.Standard2.1 \
+    --executor-shape VM.Standard2.1 \
+    --num-executors 1 \
+    --spark-version 2.4.4 \
+    --file-uri oci://<bucket>@<namespace>/nosql_example.py \
+    --archive-uri oci://<bucket>@<namespace>/archive.zip \
+    --language Python
+oci data-flow run create \
+    --application-id <application_ocid> \
+    --compartment-id <compartment_ocid> \
+    --application-id <application_ocid> \
+    --display-name 'Oracle NoSQL Example"
+```
diff --git a/python/oracle_nosql/nosql_example.py b/python/oracle_nosql/nosql_example.py
@@ -0,0 +1,218 @@
+#!/usr/bin/env python3
+
+import os
+import sys
+
+from borneo import (GetIndexesRequest, GetTableRequest, ListTablesRequest,
+                    NoSQLHandle, NoSQLHandleConfig, TableLimits, TableRequest,
+                    TableUsageRequest)
+from borneo.iam import SignatureProvider
+from pyspark import SparkConf
+from pyspark.sql import SparkSession, SQLContext
+
+
+def main():
+    # You can hard code your own values here if you want.
+    COMPARTMENT_ID = "ocid1.compartment.oc1..aaaaaaaati55ggp45kgnterqwveayuyioyhz7hw7f46umo277mn5vecpny6q"
+    ENDPOINT = "us-ashburn-1"
+    TABLE_NAME = "pysparktable"
+    INDEX_NAME = "pythonindex"
+    try:
+        COMPARTMENT_ID = sys.argv[1]
+        ENDPOINT = sys.argv[2]
+        TABLE_NAME = sys.argv[3]
+        INDEX_NAME = sys.argv[4]
+    except:
+        pass
+
+    # Set up Spark.
+    spark_session = get_dataflow_spark_session()
+
+    # Get our IAM signer.
+    token_path = get_delegation_token_path(spark_session)
+    signer = get_signer(token_path)
+
+    # The Handle to our table.
+    provider = SignatureProvider(provider=signer)
+
+    # XXX: This needs to get fixed.
+    provider.region = ENDPOINT.upper().replace("-", "_")
+
+    config = NoSQLHandleConfig(ENDPOINT, provider).set_default_compartment(
+        COMPARTMENT_ID
+    )
+
+    handle = NoSQLHandle(config)
+    try:
+        # List any existing tables for this tenant
+        print("Listing tables")
+        ltr = ListTablesRequest()
+        lr_result = handle.list_tables(ltr)
+        print("Existing tables: " + str(lr_result))
+
+        # Create a table
+        statement = (
+            "Create table if not exists "
+            + TABLE_NAME
+            + "(id integer, \
+    sid integer, name string, primary key(shard(sid), id))"
+        )
+        print("Creating table: " + statement)
+        request = (
+            TableRequest()
+            .set_statement(statement)
+            .set_table_limits(TableLimits(30, 10, 1))
+        )
+        handle.do_table_request(request, 40000, 3000)
+        print("After create table")
+
+        # Create an index
+        statement = (
+            "Create index if not exists " + INDEX_NAME + " on " + TABLE_NAME + "(name)"
+        )
+        print("Creating index: " + statement)
+        request = TableRequest().set_statement(statement)
+        handle.do_table_request(request, 40000, 3000)
+        print("After create index")
+
+        # Get the table
+        request = GetTableRequest().set_table_name(TABLE_NAME)
+        result = handle.get_table(request)
+        print("After get table: " + str(result))
+
+        # Get the indexes
+        request = GetIndexesRequest().set_table_name(TABLE_NAME)
+        result = handle.get_indexes(request)
+        print("The indexes for: " + TABLE_NAME)
+        for idx in result.get_indexes():
+            print("\t" + str(idx))
+
+        # Get the table usage information
+        request = TableUsageRequest().set_table_name(TABLE_NAME)
+        result = handle.get_table_usage(request)
+        print("The table usage information for: " + TABLE_NAME)
+        for record in result.get_usage_records():
+            print("\t" + str(record))
+    finally:
+        if handle is not None:
+            handle.close()
+
+
+def get_dataflow_spark_session(
+    app_name="DataFlow", file_location=None, profile_name=None, spark_config={}
+):
+    """
+    Get a Spark session in a way that supports running locally or in Data Flow.
+    """
+    if in_dataflow():
+        spark_builder = SparkSession.builder.appName(app_name)
+    else:
+        # Import OCI.
+        try:
+            import oci
+        except:
+            raise Exception(
+                "You need to install the OCI python library to test locally"
+            )
+
+        # Use defaults for anything unset.
+        if file_location is None:
+            file_location = oci.config.DEFAULT_LOCATION
+        if profile_name is None:
+            profile_name = oci.config.DEFAULT_PROFILE
+
+        # Load the config file.
+        try:
+            oci_config = oci.config.from_file(
+                file_location=file_location, profile_name=profile_name
+            )
+        except Exception as e:
+            print("You need to set up your OCI config properly to run locally")
+            raise e
+        conf = SparkConf()
+        conf.set("fs.oci.client.auth.tenantId", oci_config["tenancy"])
+        conf.set("fs.oci.client.auth.userId", oci_config["user"])
+        conf.set("fs.oci.client.auth.fingerprint", oci_config["fingerprint"])
+        conf.set("fs.oci.client.auth.pemfilepath", oci_config["key_file"])
+        conf.set(
+            "fs.oci.client.hostname",
+            "https://objectstorage.{0}.oraclecloud.com".format(oci_config["region"]),
+        )
+        spark_builder = SparkSession.builder.appName(app_name).config(conf=conf)
+
+    # Add in extra configuration.
+    for key, val in spark_config.items():
+        spark_builder.config(key, val)
+
+    # Create the Spark session.
+    session = spark_builder.getOrCreate()
+    return session
+
+
+def get_signer(token_path, file_location=None, profile_name=None):
+    """
+    Automatically get a local or delegation token signer.
+
+    Example: get_signer(token_path)
+    """
+    import oci
+
+    if not in_dataflow():
+        # We are running locally, use our API Key.
+        if file_location is None:
+            file_location = oci.config.DEFAULT_LOCATION
+        if profile_name is None:
+            profile_name = oci.config.DEFAULT_PROFILE
+        config = oci.config.from_file(
+            file_location=file_location, profile_name=profile_name
+        )
+        signer = oci.signer.Signer(
+            tenancy=config["tenancy"],
+            user=config["user"],
+            fingerprint=config["fingerprint"],
+            private_key_file_location=config["key_file"],
+            pass_phrase=config["pass_phrase"],
+        )
+    else:
+        # We are running in Data Flow, use our Delegation Token.
+        with open(token_path) as fd:
+            delegation_token = fd.read()
+        signer = oci.auth.signers.InstancePrincipalsDelegationTokenSigner(
+            delegation_token=delegation_token
+        )
+    return signer
+
+
+def in_dataflow():
+    """
+    Determine if we are running in OCI Data Flow by checking the environment.
+    """
+    if os.environ.get("HOME") == "/home/dataflow":
+        return True
+    return False
+
+
+def get_delegation_token_path(spark):
+    """
+    Get the delegation token path when we're running in Data Flow.
+    """
+    if not in_dataflow():
+        return None
+    token_key = "spark.hadoop.fs.oci.client.auth.delegationTokenPath"
+    token_path = spark.sparkContext.getConf().get(token_key)
+    if not token_path:
+        raise Exception(f"{token_key} is not set")
+    return token_path
+
+
+def get_temporary_directory():
+    if in_dataflow():
+        return "/opt/spark/work-dir/"
+    else:
+        import tempfile
+
+        return tempfile.gettempdir()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/python/oracle_nosql/requirements.txt b/python/oracle_nosql/requirements.txt
@@ -0,0 +1,2 @@
+borneo
+oci

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+__pycache__`
	`2`	`+logs`
	`3`	`+archive.zip`