Skip to content

Files

Latest commit

2327a06 · Mar 6, 2022

History

History

csv_to_parquet

Convert CSV data to Parquet

The most common first step in data processing applications, is to take data from some source and get it into a format that is suitable for reporting and other forms of analytics. In a database, you would load a flat file into the database and create indexes. In Spark, your first step is usually to clean and convert data from a text format into Parquet format. Parquet is an optimized binary format supporting efficient reads, making it ideal for reporting and analytics.

Convert CSV Data to Parquet

Prerequisites

Before you begin:

  • Ensure your tenant is configured according to the instructions to setup admin
  • Know your object store namespace.
  • Know the OCID of a compartment where you want to load your data and create applications.
  • (Optional, strongly recommended): Install Spark to test your code locally before deploying.

Instructions

  1. Upload a sample CSV file to object store
  2. Customize src/main/scala/example/Example.java with the OCI path to your CSV data. The format is oci://<bucket>@<namespace>/path 2a. Don't know what your namespace is? Run oci os ns get 2b. Don't have the OCI CLI installed? See to install it.
  3. Customize src/main/scala/example/Example.java with the OCI path where you would like to save output data.
  4. Compile with MVN to generate the jar file csv_to_parquet_2.11-1.0.jar.
  5. Recommended: run the sample locally to test it.
  6. Upload the JAR file csv_to_parquet_2.11-1.0.jar to an object store bucket.
  7. Create a Java Data Flow application pointing to the JAR file ```csv_to_parquet_2.11-1.0.jar`` 7a. Refer Create Scala App

To Compile

sbt package

To Test Locally

spark-submit --class example.Example target/scala-2.11/csv_to_parquet_2.11-1.0.jar

To use OCI CLI to run the Scala Application

Create a bucket. Alternatively you can re-use an existing bucket.

oci os object put --bucket-name <mybucket> --file target/scala-2.11/csv_to_parquet_2.11-1.0.jar
oci data-flow application create \
    --compartment-id <your_compartment> \
    --display-name "CSV to Parquet Scala"
    --driver-shape VM.Standard2.1 \
    --executor-shape VM.Standard2.1 \
    --num-executors 1 \
    --spark-version 2.4.4 \
    --file-uri oci://<bucket>@<namespace>/csv_to_parquet_2.11-1.0.jar \
    --language Scala \
    --class-name example.Example

Make note of the Application ID produced.

oci data-flow run create \
    --compartment-id <your_compartment> \
    --application-id <application_ocid> \
    --display-name "CSV to Parquet Scala"