|
| 1 | +# Create data processing pipeline |
| 2 | + |
| 3 | +## What are we building? |
| 4 | + |
| 5 | + |
| 6 | +We'll be building a pipeline that consists of: |
| 7 | + |
| 8 | +1. A Lambda function to take ride telemetry data from S3 and leverage a fan-out process to invoke multiple downstream Lambda processing functions (annotated item #1) |
| 9 | +1. An SQS queue to trigger the downstream consumers from the fan-out Lambda function (annotated item #2) |
| 10 | +1. A Lambda function consumer (annotated item #3) that will take items from the queue, look up the relevant weather station ID (based on lat/long proximity), and write the data record(s) out to a new location in S3 |
| 11 | + |
| 12 | +## Why are we building it? |
| 13 | +If you recall, we're trying to build a system to predict if a unicorn will request service after a ride. To do that, we need to get our data in order. We need a pipeline that will take our unicorn ride data, join it with the relevant weather station data, and output to a location that can be used downstream by our data scientist from within the Amazon SageMaker notebook. This is the beginning stages of building our training datasets. |
| 14 | + |
| 15 | +We recommend creating a file in Cloud9 where you can compile a few values. If you follow our detailed instructions, the file will be created for you in the home directory. If you want attempt the steps on your own, we recommend creating the file in your home directory to match our convention. In the end, the file will have the following values: |
| 16 | +1. Your bucket name |
| 17 | +1. The URL for your SQS queue |
| 18 | +1. The Amazon Resource Name (ARN) for your SQS queue |
| 19 | + |
| 20 | +### Short Route: Deploy the pipeline for me :see_no_evil: |
| 21 | + |
| 22 | +The purpose of this module is machine learning inference using serverless technologies. While data processing and ETL is an important component, we recommend using the provided CloudFormation template to ensure you complete the section on time. If you are very comfortable with Lambda and SQS, try the alternative route below with detailed steps. |
| 23 | + |
| 24 | +**Time to complete:** 15-20 minutes. |
| 25 | + |
| 26 | +1. Navigate to your Cloud9 environment |
| 27 | +1. Make sure you're in the correct directory first |
| 28 | + ``` |
| 29 | + cd ~/environment/aws-serverless-workshops/MachineLearning/1_DataProcessing |
| 30 | + ``` |
| 31 | +1. Run the following command to create your resources: |
| 32 | + ``` |
| 33 | + aws cloudformation create-stack \ |
| 34 | + --stack-name wildrydes-ml-mod1 \ |
| 35 | + --capabilities CAPABILITY_NAMED_IAM \ |
| 36 | + --template-body file://cloudformation/99_complete.yml |
| 37 | + ``` |
| 38 | +1. Monitor the status of your stack creation (takes about 3 minutes to complete). **EITHER:** |
| 39 | + 1. Monitor via [CloudFormation in the AWS Console](https://console.aws.amazon.com/cloudformation) **OR** |
| 40 | + 1. Run the following command in Cloud9 until you get `CREATE_COMPLETE` in the output: |
| 41 | + ``` |
| 42 | + aws cloudformation describe-stacks \ |
| 43 | + --stack-name wildrydes-ml-mod1 \ |
| 44 | + --query 'Stacks[0].StackStatus' \ |
| 45 | + --output text |
| 46 | + ``` |
| 47 | + **:heavy_exclamation_mark: DO NOT move past this point until you see CREATE_COMPLETE as the status for your CloudFormation stack** |
| 48 | +1. Set the autogenerated bucket name as an environment variable |
| 49 | + ``` |
| 50 | + bucket=$(aws cloudformation describe-stacks --stack-name wildrydes-ml-mod1 --query "Stacks[0].Outputs[?OutputKey=='DataBucketName'].OutputValue" --output text) |
| 51 | + ``` |
| 52 | +1. Verify the variable is set |
| 53 | + ``` |
| 54 | + echo $bucket |
| 55 | + ``` |
| 56 | +1. Add the bucket name to your scratchpad for future use |
| 57 | + ``` |
| 58 | + echo "Bucket name:" $bucket >> ~/environment/scratchpad.txt |
| 59 | + ``` |
| 60 | +1. Run this command to upload the ride data |
| 61 | + ``` |
| 62 | + aws s3 cp assets/ride_data.json s3://$bucket/raw/ride_data.json |
| 63 | + ``` |
| 64 | +1. Run this command to verify the file was uploaded (you should see the file name listed) |
| 65 | + ``` |
| 66 | + aws s3 ls s3://$bucket/raw/ |
| 67 | + ``` |
| 68 | +
|
| 69 | +### Long Route: Build the pipeline yourself :white_check_mark: |
| 70 | +
|
| 71 | +If you would like to learn more about serverless ETL, have experience working with Lambda and SQS, or want to take your time regardless of the duration then this route is a fit for you. This route will deploy the same components as described above. You will need to configure them to communicate with each other. |
| 72 | +
|
| 73 | +**Time to complete:** 30-60 minutes. |
| 74 | +
|
| 75 | +<details> |
| 76 | +<summary><strong>Expand for detailed instructions</strong></summary><p> |
| 77 | +
|
| 78 | +### Step 1: Create an S3 Bucket |
| 79 | +This is where your data will live before, during, and after formatting. It's also where your machine learning model will output to. |
| 80 | +
|
| 81 | +<details> |
| 82 | +<summary>Create an S3 bucket with a globally unique name and save the name to a scratchpad.txt file that we will use later. (Expand for detailed instructions)</summary><p> |
| 83 | +
|
| 84 | +1. Navigate to your Cloud9 environment |
| 85 | +1. Run this command to set your desired bucket name as an environment variable (Replacing YOUR_BUCKET_NAME with your desired bucket name) |
| 86 | + ``` |
| 87 | + bucket="YOUR_BUCKET_NAME" |
| 88 | + ``` |
| 89 | +1. Run this command to create your bucket |
| 90 | + ``` |
| 91 | + aws s3 mb s3://$bucket |
| 92 | + ``` |
| 93 | +1. If the above command is successful, run the following command. If you get an error, your bucket name is likely already taken. Repeat these steps with a new name. |
| 94 | + ``` |
| 95 | + echo "Bucket name:" $bucket >> ~/environment/scratchpad.txt |
| 96 | + ``` |
| 97 | +1. Run this command to verify your bucket was created successfully |
| 98 | + ``` |
| 99 | + aws s3 ls s3://$bucket |
| 100 | + # If you don't see an error you're good. |
| 101 | + ``` |
| 102 | +</p></details> |
| 103 | +
|
| 104 | +### Step 2: Create an SQS queue for fan-out |
| 105 | +Our vehicle fleet generates ride data in a single, massive .json file, [ride_data.json](assets/ride_data.json). Feel free to check it out. It includes the raw ride telemetry. We need to split out the file into individual JSON entries, one for each ride data event entry. |
| 106 | +
|
| 107 | +To take advantage of the parallelism available with Lambda, we are going to fan-out each entry to a queue that will be picked up by individual Lambda functions. |
| 108 | +
|
| 109 | +<details> |
| 110 | +<summary>Create an SQS queue and name it `IngestedRawDataFanOutQueue`. Save the queue URL and ARN to a `scratchpad.txt` file that we will use later. (Expand for detailed instructions)</summary><p> |
| 111 | +
|
| 112 | +1. Navigate to your Cloud9 environment |
| 113 | +1. Make sure you're in the correct directory first |
| 114 | + ``` |
| 115 | + cd ~/environment/aws-serverless-workshops/MachineLearning/1_DataProcessing |
| 116 | + ``` |
| 117 | +1. Run the following command to create your queue: |
| 118 | + ``` |
| 119 | + aws sqs create-queue --queue-name IngestedRawDataFanOutQueue |
| 120 | + ``` |
| 121 | +1. Set the queue URL as an environment variable |
| 122 | + ``` |
| 123 | + queue_url=$(aws sqs get-queue-url --queue-name IngestedRawDataFanOutQueue --output text) |
| 124 | + ``` |
| 125 | +1. Verify the queue URL is set and put the value in your scratchpad for future use |
| 126 | + ``` |
| 127 | + echo $queue_url && echo "Queue URL: " $queue_url >> ~/environment/scratchpad.txt |
| 128 | + ``` |
| 129 | +1. Get the queue ARN and set it as an environment variable |
| 130 | + ``` |
| 131 | + queue_arn=$(aws sqs get-queue-attributes --queue-url $queue_url --attribute-names QueueArn --query 'Attributes.QueueArn' --output text) |
| 132 | + ``` |
| 133 | +1. Verify the queue ARN is set and put the value in your scratchpad for future use |
| 134 | + ``` |
| 135 | + echo $queue_arn && echo "Queue ARN: " $queue_arn >> ~/environment/scratchpad.txt |
| 136 | + ``` |
| 137 | +</p></details> |
| 138 | +
|
| 139 | +
|
| 140 | +### Step 3: Create the remaining infrastructure |
| 141 | +
|
| 142 | +<details> |
| 143 | +<summary>Create a CloudFormation stack from `cloudformation/1_lambda_functions.yml` named `wildrydes-ml-mod1`. (Expand for detailed instructions)</summary><p> |
| 144 | +
|
| 145 | +1. Navigate to your Cloud9 environment |
| 146 | +1. Run the following command to create your infrastructure |
| 147 | + ``` |
| 148 | + # Command should be ran from /home/ec2-user/environment/aws-serverless-workshops/MachineLearning/1_DataProcessing in your cloud 9 environment |
| 149 | + # run `pwd` to see your current directory |
| 150 | +
|
| 151 | + aws cloudformation create-stack \ |
| 152 | + --stack-name wildrydes-ml-mod1 \ |
| 153 | + --parameters ParameterKey=DataBucket,ParameterValue=$bucket \ |
| 154 | + ParameterKey=IngestedRawDataFanOutQueueArn,ParameterValue=$queue_arn \ |
| 155 | + --capabilities CAPABILITY_NAMED_IAM \ |
| 156 | + --template-body file://cloudformation/1_lambda_functions.yml |
| 157 | + ``` |
| 158 | +1. Monitor the status of your stack creation. **EITHER:** |
| 159 | + 1. Go to [CloudFormation in the AWS Console](https://console.aws.amazon.com/cloudformation) **OR** |
| 160 | + 1. Run the following command in Cloud9 until you get `CREATE_COMPLETE` in the output: |
| 161 | + ``` |
| 162 | + # Run this command to verify the stack was successfully created. You should expect to see "CREATE_COMPLETE". |
| 163 | + # If you see "CREATE_IN_PROGRESS", your stack is still being created. Wait and re-run the command. |
| 164 | + # If you see "ROLLBACK_COMPLETE", pause and see what went wrong. |
| 165 | + aws cloudformation describe-stacks \ |
| 166 | + --stack-name wildrydes-ml-mod1 \ |
| 167 | + --query "Stacks[0].StackStatus" |
| 168 | + ``` |
| 169 | +</p></details><br> |
| 170 | +
|
| 171 | +**:heavy_exclamation_mark: DO NOT move past this point until you see CREATE_COMPLETE as the status for your CloudFormation stack** |
| 172 | +
|
| 173 | +After CloudFormation is done creating your infrastructure, you will have: |
| 174 | +* Lambda function skeletons |
| 175 | +* Dead Letter Queues (DLQ) |
| 176 | +* IAM permissions |
| 177 | +* CloudWatch dashboard |
| 178 | +
|
| 179 | +While these are necessary components of our data processing pipeline, they're not the focus of this part of the lab. This is why we're creating them in a CloudFormation template for you. |
| 180 | +
|
| 181 | +### Step 4: Wire up the Lambda functions |
| 182 | +The previous step gave you the foundation for the Lambda functions that will either be triggered by S3 events or our SQS queue. Now, you need to wire up the Lambda functions to appropriate event sources and set some environment variables. We're going to use values from scratchpad.txt, so have that handy. |
| 183 | +
|
| 184 | +Expand each substep for detailed instructions, if needed. |
| 185 | +
|
| 186 | +<details> |
| 187 | +<summary>1. Update the <code>OUTPUT_QUEUE</code> environment variable in <code>IngestUnicornRawDataFunction</code>. Set the value to your Queue URL (in scratchpad.txt).</summary><p> |
| 188 | +
|
| 189 | +1. Open the [Lambda console](https://console.aws.amazon.com/lambda) |
| 190 | +1. Open the function containing `IngestUnicornRawDataFunction` in the name |
| 191 | +1. Scroll down and populate the `OUTPUT_QUEUE` key with the Queue URL value from your scratchpad |
| 192 | +1. Click **Save** |
| 193 | +</p></details> |
| 194 | +
|
| 195 | +<details> |
| 196 | +<summary>2. Add an S3 trigger to <code>IngestUnicornRawDataFunction</code>. Trigger off your S3 bucket and `raw/` prefix.</summary><p> |
| 197 | +
|
| 198 | +1. Scroll up and click **Add trigger** in the Designer view |
| 199 | +1. Select **S3** |
| 200 | +1. Choose the data bucket you created |
| 201 | +1. For the prefix, type `raw/` |
| 202 | +1. Click **Add** |
| 203 | +
|
| 204 | +If the trigger won't save, make sure the S3 bucket does not have an identical active event ([Bucket](https://console.aws.amazon.com/s3) > Properties > Events). |
| 205 | +</p></details> |
| 206 | +
|
| 207 | +<details> |
| 208 | +<summary>3. Update the <code>OUTPUT_BUCKET</code> environment variable in <code>TransformAndMapDataFunction</code>. Set the value to your bucket name.</summary><p> |
| 209 | +
|
| 210 | +1. Open the [Lambda console](https://console.aws.amazon.com/lambda) |
| 211 | +1. Open the function containing `TransformAndMapDataFunction` in the name |
| 212 | +1. Scroll down and populate the `OUTPUT_BUCKET` key with the Bucket Name value from your scratchpad. Keep in mind, just provide the name of the data bucket you created earlier; it should not fully qualified. |
| 213 | +1. Click **Save** |
| 214 | +</p></details> |
| 215 | +
|
| 216 | +<details> |
| 217 | +<summary>4. Add an SQS trigger to <code>TransformAndMapDataFunction</code>. Trigger off your <code>IngestedRawDataFanOutQueue</code> queue.</summary><p> |
| 218 | +
|
| 219 | +1. Scroll up and click **Add trigger** in the Designer view |
| 220 | +2. Select **SQS** |
| 221 | +3. Choose the `IngestedRawDataFanOutQueue` queue you created |
| 222 | +4. Click **Add** |
| 223 | +</p></details><br> |
| 224 | +
|
| 225 | +Let's recap what we created: |
| 226 | +* Serverless data processing pipeline: |
| 227 | + 1. A Lambda function that reads a large JSON file from S3 and places a message in a queue for each ride |
| 228 | + 1. A queue that buffers messages for each ride |
| 229 | + 1. A Lambda function that picks up messages in the queue and matches the nearest weather station |
| 230 | + * Review the code for `TransformAndMapDataFunction`, the function is doing a lookup for the nearest weather station |
| 231 | +* Preconfigured IAM role for the Lambda functions scoped to the appropriate services |
| 232 | +* We also have a [CloudWatch dashboard](https://console.aws.amazon.com/cloudwatch/home?#dashboards:name=Wild_Rydes_Machine_Learning) to monitor progress! |
| 233 | +
|
| 234 | +### Step 5: Test your pipeline |
| 235 | +It's time to upload our ride telemetry data into our pipeline. |
| 236 | +
|
| 237 | +<details> |
| 238 | +<summary>Upload <code>assets/ride_data.json</code> into <code>YOUR_DATA_BUCKET/raw/</code> (Expand for detailed instructions)</summary><p> |
| 239 | +
|
| 240 | +1. In your Cloud9 terminal, run the following code: |
| 241 | + ``` |
| 242 | + # Run this command to upload the ride data |
| 243 | + aws s3 cp assets/ride_data.json s3://$bucket/raw/ride_data.json |
| 244 | +
|
| 245 | + # Run this command to verify the file was uploaded (you should see the file name listed) |
| 246 | + aws s3 ls s3://$bucket/raw/ |
| 247 | + ``` |
| 248 | +</p></details> |
| 249 | +
|
| 250 | +</p></details> |
| 251 | +
|
| 252 | +### Your fan-out is in progress! |
| 253 | +
|
| 254 | +It will take ~8 minutes to process all ~20k entries. Monitor the progress using: |
| 255 | +* [CloudWatch dashboard](https://console.aws.amazon.com/cloudwatch/home?#dashboards:) |
| 256 | + * This shows you the number of invocations for each Lambda function on the left and some SQS message metrics on the right. |
| 257 | + * Refresh the dashboard to see how the plotted invocation count changes. When you see the invocation count settle back to zero, that indicates all of your data has been processed. Until then, feel free to explore the graphs and options. |
| 258 | +* [SQS console](https://console.aws.amazon.com/sqs) |
| 259 | + * This shows the number of messages flowing through `IngestedRawDataFanOutQueue`. There are also dead letter queues set up in case things go wrong. |
| 260 | +* Run `aws s3 ls s3://$bucket/processed/ | wc -l` in your Cloud9 terminal |
| 261 | + * This provides a count of the number of entries in your processed folder as the pipeline progresses. |
| 262 | + * When complete you should have 19,995 records. |
| 263 | +
|
| 264 | +## Next step: |
| 265 | +
|
| 266 | +We're ready to proceed with building and training a [machine learning model](../2_ModelBuilding). |
0 commit comments