Skip to content
This repository was archived by the owner on May 23, 2023. It is now read-only.

Updated the instructions to use Glue Studio #11

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
184 changes: 95 additions & 89 deletions labs/01_ingestion_with_glue/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,15 @@
- [Configure Permissions](#configure-permissions)
- [Creating a Policy for Amazon S3 Bucket (Console)](#creating-a-policy-for-amazon-s3-bucket-console)
- [Creating a Role for AWS Service Glue (Console)](#creating-a-role-for-aws-service-glue-console)
- [Creating a Development Endpoint and Notebook - Step 1](#creating-a-development-endpoint-and-notebook---step-1)
- [Create data catalog from S3 files](#create-data-catalog-from-s3-files)
- [Transform the data to Parquet format](#transform-the-data-to-parquet-format)
- [Adding a source from the catalog](#adding-a-source-from-the-catalog)
- [Adding transforms](#adding-transforms)
- [Storing the results](#storing-the-results)
- [Running the job](#running-the-job)
- [Monitoring the job](#monitoring-the-job)
- [Add a crawler for curated data](#add-a-crawler-for-curated-data)
- [Schema Validation](#schema-validation)
- [Creating a Development Endpoint and Notebook - Step 2](#creating-a-development-endpoint-and-notebook---step-2)

In this Lab we will create a schema from your data optimized for analytics and place the result in an S3 bucket-based data lake.

Expand Down Expand Up @@ -108,30 +111,6 @@ In this lab we will:

NOTE: “AWSGlueServiceRole” is an AWS Managed Policy to provide Glue with needed permissions to access S3 data. However, you still need to allow access to your specific S3 bucket for Glue by attaching “BYOD-S3Policy” created policy.

## Creating a Development Endpoint and Notebook - Step 1

> Development endpoint and notebook will be used in Lab 5 of this workshop. Since it takes a bit of time to create the resources, we are doing it now so that they will be ready when we need them.

In AWS Glue, you can create an environment — known as a development endpoint — that you can use to iteratively develop and test your extract, transform, and load (ETL) scripts.

You can then create a notebook that connects to the endpoint, and use your notebook to author and test your ETL script. When you're satisfied with the results of your development process, you can create an ETL job that runs your script. With this process, you can add functions and debug your scripts in an interactive manner.

It is also possible to connect your local IDE to this endpoint, which is explained here: [Tutorial: Set Up PyCharm Professional with a Development Endpoint](https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-pycharm.html)

How to create an endpoint and use it from a notebook:

Go to Glue in the console https://console.aws.amazon.com/glue/
1. On the left menu, click in Dev endpoints and **Add endpoint**.
2. Development endpoint name: `byod`
3. IAM role: **glue-processor-role**
4. Click **Next**
5. Select Skip networking information
6. Click **Next**
7. Click **Next** \- No need to Add SSH public key for now
8. Click **Finish**

It will take a while to create the endpoint.

## Create data catalog from S3 files

We will be using AWS Glue Crawlers to infer the schema of the files and create data catalog. Without a crawler, you can still read data from the S3 by a Glue job, but it will not be able to determine data types (string, int, etc) for each column.
Expand Down Expand Up @@ -161,70 +140,106 @@ In the following section, we will create one job per each file to transform the

We will place this data under the folder named "_curated_" in the data lake.

- In the Glue Console select the **Jobs** section in the left navigation panel'
- Click on the _Add job_ button;
- specify a name (preferably **TABLE-NAME-1-job**) in the name field, then select the _"glue-processor-role"_;
- select Type: **Spark**
- make sure Glue version 2 is selected: "Spark 2.4, Python 3 with improved job startup times (Glue Version 2.0)" (If you want to read more about version 2: [Glue version 2 announced](https://aws.amazon.com/blogs/aws/aws-glue-version-2-0-featuring-10x-faster-job-start-times-and-1-minute-minimum-billing-duration/))
- select the option "_A new script to be authored by you_";
- Provide a script name (preferably **TABLE-NAME-1-job-script.py**)
- Tick the checkbox for "_Job Metrics_", under **Monitoring Options** and DO NOT hit **Next** yet;
- Under "Security configuration, script libraries, and job parameters (optional)", check that **Worker type** is "Standard" and **Number of workers** is "10". This determines the worker type and the number of processing units to be used for the job. Higher numbers result in faster processing times but may incur higher costs. This should be determined according to data size, data type etc. (further info can be found in [Glue documentation](https://docs.aws.amazon.com/glue/latest/dg/add-job.html).) - hit **Next**
- click **Next**, then **Save job and edit script**. You will be redirected to script editor.
- Paste the following code to the editor. **DONT FORGET TO PUT IN YOUR INPUT AND OUTPUT FOLDER LOCATIONS.**
- In the Glue Console select **AWS Glue Studio**
- On the AWS Glue Studio home page, **Create and manage jobs**

This step needs to be done per each file you have.
![create and manage jobs](./img/ingestion/aws-glue-studio-2.jpg)

```python
import sys
import datetime
import re
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
job = Job(glueContext)

## DONT FORGET TO PUT IN YOUR INPUT AND OUTPUT LOCATIONS BELOW.
your_database_name = "YOUR-DATABASE-NAME"
your_table_name = "YOUR-TABLE-NAME"
output_location = "s3://YOUR-BUCKET-NAME/curated/TABLE-NAME"

job.init("byod-workshop" + str(datetime.datetime.now().timestamp()))

#load our data from the catalog that we created with a crawler
dynamicF = glueContext.create_dynamic_frame.from_catalog(
database = your_database_name,
table_name = your_table_name,
transformation_ctx = "dynamicF")

# invalid characters in column names are replaced by _
df = dynamicF.toDF()
def canonical(x): return re.sub("[ ,;{}()\n\t=]+", '_', x.lower())
renamed_cols = [canonical(c) for c in df.columns]
df = df.toDF(*renamed_cols)

# write our dataframe in parquet format to an output s3 bucket
df.write.mode("overwrite").format("parquet").save(output_location)

job.commit()
- AWS Glue Studio supports different sources, including Amazon S3, Amazon RDS, Amazon Kinesis and Apache Kafka. For the transformation you will use one AWS table as the data source and one S3 bucket as the destination.

- In the **Create Job** section, select **Source and target added to the graph**. Make sure **S3** is configured as the both the **Source** and **Target** then click **Create**.

![create job](./img/ingestion/aws-glue-studio-3.png)
This takes you to the Visual Canvas to create an AWS Glue job. You should already see the canvas prepopulated with a basic diagram.
- Change the **Job name** from **Untitled job** to the desired name (preferably **TABLE-NAME-1-job**)
![rename job](./img/ingestion/aws-glue-studio-4.png)

### Adding a source from the catalog
1. Select the **Data source - S3** bucket node.
2. On the **Data source properties - S3** tab, choose the relevant Database and table. Leave the partition predicate field empty.

![add source](./img/ingestion/aws-glue-studio-5.png)

### Adding transforms

A transform is the AWS Glue Studio component were the data is modified. You have the option of using different transforms that are part of this service or custom code.

1. One **ApplyMapping** transform has automatically been added for you. Click it to modify it.
2. On the **transform** tab, change the data types for specific columns to the desired values. You can also choose to rename columns.
3. Drop the columns that you will not require downstream.

![rename columns](./img/ingestion/aws-glue-studio-6.png)

Now we will add a second custom transform to the data source, where we are replacing invalid characters that you may have in your column headers. Spark doesn't accept certain characters in field names including spaces, so it is better to fix this before we send the data down stream.
1. Click the first **ApplyMapping** transform node.
2. Click the **(+)** icon.

![](./img/ingestion/aws-glue-studio-7.png)

3. On the Node properties tab, for Name enter **Column Header Cleaner**.
4. For Node type, choose **Custom transform**

![](./img/ingestion/aws-glue-studio-77.png)

5. On the Transform tab for Code block, change the function name from MyTransform to **ColumnHeaderCleaner**
6. Enter the following code under the function body:
```
import re
def canonical(x): return re.sub("[ ,;{}()\n\t=]+", '_', x.lower())

# select the first collection from the DynamicFrameCollection
selected = dfc.select(list(dfc.keys())[0]).toDF()

renamed_cols = [canonical(c) for c in selected.columns]
cleaned_df = DynamicFrame.fromDF(selected.toDF(*renamed_cols), glueContext,"cleaned_df")
return DynamicFrameCollection({"cleaned_df": cleaned_df}, glueContext)
```
![](./img/ingestion/aws-glue-studio-8.png)

After adding the custom transformation to the AWS Glue job, you want to store the result of the aggregation in the S3 bucket. To do this, you need a Select from collection transform to read the output from the **Column Header Cleaner** node and send it to the destination.

7. Choose the **New node** node.
8. Leave the **Transform** tab with the default values.
9. On the **Node Properties** tab, change the name of the transform to **Select Aggregated Data**.
10. Leave everything else with the default values.

### Storing the results
1. Select the **Data target - S3 bucket** node
2. Under **Node properties**, change the Node parent to be the **Select Aggregated Data** Transform
![](./img/ingestion/aws-glue-studio-99.png)

3. Under **Data target properties - S3**, select **Parquet** as format and the compression type to be **GZIP**. Select the curated location as the **S3 target location**.

Notice that we have a section in this script where we are replacing invalid characters that you may have in your column headers. Spark doesn't accept certain characters in field names including spaces.
![](./img/ingestion/aws-glue-studio-9.png)

Click \* **Save** and **Run Job**
If you followed this guide closely, your final schematic should look similar to the one below:
![](./img/ingestion/aws-glue-studio-10.png)

![add a glue job](./img/ingestion/glue-job3.png)
### Running the job

Check the status of the job by selecting the job and go to history tab in the lower panel. In order to continue we need to wait until this job is done, this can take around 5 minutes (and up to 10 minutes to start), depending on the size of your dataset.
1. Under Job details, select the _"glue-processor-role"_ as the IAM Role
2. Select Type: **Spark**
3. Make sure Glue version 2 is selected: "Glue 2.0 - Supports spark 2.4, Scala 2, Python 3" (If you want to read more about version 2: [Glue version 2 announced](https://aws.amazon.com/blogs/aws/aws-glue-version-2-0-featuring-10x-faster-job-start-times-and-1-minute-minimum-billing-duration/))
4. Check that **G.1x** is selected as the worker type and that **Worker type** and that **Number of workers** is "10". This determines the worker type and the number of processing units to be used for the job. Higher numbers result in faster processing times but may incur higher costs. This should be determined according to data size, data type etc. (further info can be found in [Glue documentation](https://docs.aws.amazon.com/glue/latest/dg/add-job.html).)

![add a glue job](./img/ingestion/seejob.png)
![](./img/ingestion/aws-glue-studio-11.png)

To make sure the job transformed the data, go to S3, you should see a new sub-folder called curated with data on it.
5. Click **Save** and **Run Job**

Now, remember to repeat this last step per each file you had originally.
### Monitoring the job
AWS Glue Studio offers a job monitoring dashboard that provides comprehensive information about your jobs. You can get job statistics and see detailed info about the job and the job status when running.

In the AWS Glue Studio navigation panel, choose Monitoring.
Choose the entry with the job name you have configured above.
To get more details about the job run, choose View run details.

![](./img/ingestion/aws-glue-studio-13.jpg)

Wait until **Run Status** changes to **Succeeded**. This can take up to several minutes, depending on the size of your dataset.

![](./img/ingestion/aws-glue-studio-12.png)

**NOTE: Now, remember to repeat this create job step for each file you had originally.**

## Add a crawler for curated data

Expand Down Expand Up @@ -260,15 +275,6 @@ NOTE: If you have any "id" column as integer, please make sure type is set to "d

- Click Save.

## Creating a Development Endpoint and Notebook - Step 2

1. In the glue console, Go to Notebooks, click Create notebook
2. Notebook name: aws-glue-`byod`
3. Attach to development: choose the endpoint created some steps back
4. Create a new IAM Role.
5. **Create notebook**


Now go to lab 2 : [Orchestration](../02_orchestration/README.md)


Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed labs/01_ingestion_with_glue/img/ingestion/seejob.png
Binary file not shown.
Loading