Skip to content
This repository was archived by the owner on Jan 14, 2025. It is now read-only.

Commit 2e852e8

Browse files
jmcwhirterangelarw
authored andcommitted
New module for machine learning (#256)
new module on Serverless inference with AWS Lambda
1 parent 0d72965 commit 2e852e8

File tree

26 files changed

+202405
-0
lines changed

26 files changed

+202405
-0
lines changed

MachineLearning/0_Setup/README.md

+64
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Set up your development environment
2+
3+
**Time to complete:** 5-10 minutes.
4+
5+
## What are we building?
6+
7+
We are going to use [AWS Cloud9](https://aws.amazon.com/cloud9/) as our cloud-based integrated development environment. It will get you bootstrapped with the right tools and access on Day 1.
8+
9+
_If you already have a Cloud9 environment, feel free to use that._
10+
11+
### Step 1: Create a Cloud9 environment
12+
13+
<details>
14+
<summary><strong>Expand if you want detailed directions</strong></summary><p>
15+
16+
Create your Cloud9 instance by following these steps:
17+
18+
1. Navigate to AWS Cloud9 [in the console](https://console.aws.amazon.com/cloud9)
19+
1. Click **Create environment**
20+
1. Provide a name: **WildRydesIDE**
21+
1. Click **Next step**
22+
1. Leave all defaults
23+
1. Click **Next step**
24+
1. Click **Create environment**
25+
26+
</p></details>
27+
28+
### Step 2: Wait for your environment to be already
29+
30+
Your AWS Cloud9 environment is being created and your screen will look like this:
31+
32+
![new_tab](assets/cloud9_wait.png)
33+
34+
After a minute or so, your environment will be ready and you can continue.
35+
36+
### Step 3: Validate your environment has credentials
37+
38+
1. Find the "Welcome" tab and click the plus icon next to it
39+
1. Select **New Terminal**
40+
1. Run a command to get the caller identity: `aws sts get-caller-identity`
41+
* *This command will let you know who you are (account number, user ID, ARN)*
42+
43+
*Hint: New editors and terminals can be created by clicking the green "+" icon in a circle*
44+
45+
![new_tab](assets/new_tab.png)
46+
47+
### Step 4: Clone this repository
48+
49+
Let's get our code and start working. Inside the terminal:
50+
51+
1. Run the following command to get our code:
52+
```
53+
git clone https://github.com/jmcwhirter/aws-serverless-workshops/
54+
```
55+
1. Navigate to our workshop:
56+
```
57+
cd aws-serverless-workshops/MachineLearning/
58+
```
59+
60+
At this point we have built our cloud based development environment, verified it is configured with the right credentials, and copied down some source code from a code repository.
61+
62+
## Next step:
63+
64+
We're ready to proceed with building the [data processing pipeline](../1_DataProcessing).
324 KB
Loading
93.3 KB
Loading
+266
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,266 @@
1+
# Create data processing pipeline
2+
3+
## What are we building?
4+
![Architecture diagram](assets/WildRydesML_1.png)
5+
6+
We'll be building a pipeline that consists of:
7+
8+
1. A Lambda function to take ride telemetry data from S3 and leverage a fan-out process to invoke multiple downstream Lambda processing functions (annotated item #1)
9+
1. An SQS queue to trigger the downstream consumers from the fan-out Lambda function (annotated item #2)
10+
1. A Lambda function consumer (annotated item #3) that will take items from the queue, look up the relevant weather station ID (based on lat/long proximity), and write the data record(s) out to a new location in S3
11+
12+
## Why are we building it?
13+
If you recall, we're trying to build a system to predict if a unicorn will request service after a ride. To do that, we need to get our data in order. We need a pipeline that will take our unicorn ride data, join it with the relevant weather station data, and output to a location that can be used downstream by our data scientist from within the Amazon SageMaker notebook. This is the beginning stages of building our training datasets.
14+
15+
We recommend creating a file in Cloud9 where you can compile a few values. If you follow our detailed instructions, the file will be created for you in the home directory. If you want attempt the steps on your own, we recommend creating the file in your home directory to match our convention. In the end, the file will have the following values:
16+
1. Your bucket name
17+
1. The URL for your SQS queue
18+
1. The Amazon Resource Name (ARN) for your SQS queue
19+
20+
### Short Route: Deploy the pipeline for me :see_no_evil:
21+
22+
The purpose of this module is machine learning inference using serverless technologies. While data processing and ETL is an important component, we recommend using the provided CloudFormation template to ensure you complete the section on time. If you are very comfortable with Lambda and SQS, try the alternative route below with detailed steps.
23+
24+
**Time to complete:** 15-20 minutes.
25+
26+
1. Navigate to your Cloud9 environment
27+
1. Make sure you're in the correct directory first
28+
```
29+
cd ~/environment/aws-serverless-workshops/MachineLearning/1_DataProcessing
30+
```
31+
1. Run the following command to create your resources:
32+
```
33+
aws cloudformation create-stack \
34+
--stack-name wildrydes-ml-mod1 \
35+
--capabilities CAPABILITY_NAMED_IAM \
36+
--template-body file://cloudformation/99_complete.yml
37+
```
38+
1. Monitor the status of your stack creation (takes about 3 minutes to complete). **EITHER:**
39+
1. Monitor via [CloudFormation in the AWS Console](https://console.aws.amazon.com/cloudformation) **OR**
40+
1. Run the following command in Cloud9 until you get `CREATE_COMPLETE` in the output:
41+
```
42+
aws cloudformation describe-stacks \
43+
--stack-name wildrydes-ml-mod1 \
44+
--query 'Stacks[0].StackStatus' \
45+
--output text
46+
```
47+
**:heavy_exclamation_mark: DO NOT move past this point until you see CREATE_COMPLETE as the status for your CloudFormation stack**
48+
1. Set the autogenerated bucket name as an environment variable
49+
```
50+
bucket=$(aws cloudformation describe-stacks --stack-name wildrydes-ml-mod1 --query "Stacks[0].Outputs[?OutputKey=='DataBucketName'].OutputValue" --output text)
51+
```
52+
1. Verify the variable is set
53+
```
54+
echo $bucket
55+
```
56+
1. Add the bucket name to your scratchpad for future use
57+
```
58+
echo "Bucket name:" $bucket >> ~/environment/scratchpad.txt
59+
```
60+
1. Run this command to upload the ride data
61+
```
62+
aws s3 cp assets/ride_data.json s3://$bucket/raw/ride_data.json
63+
```
64+
1. Run this command to verify the file was uploaded (you should see the file name listed)
65+
```
66+
aws s3 ls s3://$bucket/raw/
67+
```
68+
69+
### Long Route: Build the pipeline yourself :white_check_mark:
70+
71+
If you would like to learn more about serverless ETL, have experience working with Lambda and SQS, or want to take your time regardless of the duration then this route is a fit for you. This route will deploy the same components as described above. You will need to configure them to communicate with each other.
72+
73+
**Time to complete:** 30-60 minutes.
74+
75+
<details>
76+
<summary><strong>Expand for detailed instructions</strong></summary><p>
77+
78+
### Step 1: Create an S3 Bucket
79+
This is where your data will live before, during, and after formatting. It's also where your machine learning model will output to.
80+
81+
<details>
82+
<summary>Create an S3 bucket with a globally unique name and save the name to a scratchpad.txt file that we will use later. (Expand for detailed instructions)</summary><p>
83+
84+
1. Navigate to your Cloud9 environment
85+
1. Run this command to set your desired bucket name as an environment variable (Replacing YOUR_BUCKET_NAME with your desired bucket name)
86+
```
87+
bucket="YOUR_BUCKET_NAME"
88+
```
89+
1. Run this command to create your bucket
90+
```
91+
aws s3 mb s3://$bucket
92+
```
93+
1. If the above command is successful, run the following command. If you get an error, your bucket name is likely already taken. Repeat these steps with a new name.
94+
```
95+
echo "Bucket name:" $bucket >> ~/environment/scratchpad.txt
96+
```
97+
1. Run this command to verify your bucket was created successfully
98+
```
99+
aws s3 ls s3://$bucket
100+
# If you don't see an error you're good.
101+
```
102+
</p></details>
103+
104+
### Step 2: Create an SQS queue for fan-out
105+
Our vehicle fleet generates ride data in a single, massive .json file, [ride_data.json](assets/ride_data.json). Feel free to check it out. It includes the raw ride telemetry. We need to split out the file into individual JSON entries, one for each ride data event entry.
106+
107+
To take advantage of the parallelism available with Lambda, we are going to fan-out each entry to a queue that will be picked up by individual Lambda functions.
108+
109+
<details>
110+
<summary>Create an SQS queue and name it `IngestedRawDataFanOutQueue`. Save the queue URL and ARN to a `scratchpad.txt` file that we will use later. (Expand for detailed instructions)</summary><p>
111+
112+
1. Navigate to your Cloud9 environment
113+
1. Make sure you're in the correct directory first
114+
```
115+
cd ~/environment/aws-serverless-workshops/MachineLearning/1_DataProcessing
116+
```
117+
1. Run the following command to create your queue:
118+
```
119+
aws sqs create-queue --queue-name IngestedRawDataFanOutQueue
120+
```
121+
1. Set the queue URL as an environment variable
122+
```
123+
queue_url=$(aws sqs get-queue-url --queue-name IngestedRawDataFanOutQueue --output text)
124+
```
125+
1. Verify the queue URL is set and put the value in your scratchpad for future use
126+
```
127+
echo $queue_url && echo "Queue URL: " $queue_url >> ~/environment/scratchpad.txt
128+
```
129+
1. Get the queue ARN and set it as an environment variable
130+
```
131+
queue_arn=$(aws sqs get-queue-attributes --queue-url $queue_url --attribute-names QueueArn --query 'Attributes.QueueArn' --output text)
132+
```
133+
1. Verify the queue ARN is set and put the value in your scratchpad for future use
134+
```
135+
echo $queue_arn && echo "Queue ARN: " $queue_arn >> ~/environment/scratchpad.txt
136+
```
137+
</p></details>
138+
139+
140+
### Step 3: Create the remaining infrastructure
141+
142+
<details>
143+
<summary>Create a CloudFormation stack from `cloudformation/1_lambda_functions.yml` named `wildrydes-ml-mod1`. (Expand for detailed instructions)</summary><p>
144+
145+
1. Navigate to your Cloud9 environment
146+
1. Run the following command to create your infrastructure
147+
```
148+
# Command should be ran from /home/ec2-user/environment/aws-serverless-workshops/MachineLearning/1_DataProcessing in your cloud 9 environment
149+
# run `pwd` to see your current directory
150+
151+
aws cloudformation create-stack \
152+
--stack-name wildrydes-ml-mod1 \
153+
--parameters ParameterKey=DataBucket,ParameterValue=$bucket \
154+
ParameterKey=IngestedRawDataFanOutQueueArn,ParameterValue=$queue_arn \
155+
--capabilities CAPABILITY_NAMED_IAM \
156+
--template-body file://cloudformation/1_lambda_functions.yml
157+
```
158+
1. Monitor the status of your stack creation. **EITHER:**
159+
1. Go to [CloudFormation in the AWS Console](https://console.aws.amazon.com/cloudformation) **OR**
160+
1. Run the following command in Cloud9 until you get `CREATE_COMPLETE` in the output:
161+
```
162+
# Run this command to verify the stack was successfully created. You should expect to see "CREATE_COMPLETE".
163+
# If you see "CREATE_IN_PROGRESS", your stack is still being created. Wait and re-run the command.
164+
# If you see "ROLLBACK_COMPLETE", pause and see what went wrong.
165+
aws cloudformation describe-stacks \
166+
--stack-name wildrydes-ml-mod1 \
167+
--query "Stacks[0].StackStatus"
168+
```
169+
</p></details><br>
170+
171+
**:heavy_exclamation_mark: DO NOT move past this point until you see CREATE_COMPLETE as the status for your CloudFormation stack**
172+
173+
After CloudFormation is done creating your infrastructure, you will have:
174+
* Lambda function skeletons
175+
* Dead Letter Queues (DLQ)
176+
* IAM permissions
177+
* CloudWatch dashboard
178+
179+
While these are necessary components of our data processing pipeline, they're not the focus of this part of the lab. This is why we're creating them in a CloudFormation template for you.
180+
181+
### Step 4: Wire up the Lambda functions
182+
The previous step gave you the foundation for the Lambda functions that will either be triggered by S3 events or our SQS queue. Now, you need to wire up the Lambda functions to appropriate event sources and set some environment variables. We're going to use values from scratchpad.txt, so have that handy.
183+
184+
Expand each substep for detailed instructions, if needed.
185+
186+
<details>
187+
<summary>1. Update the <code>OUTPUT_QUEUE</code> environment variable in <code>IngestUnicornRawDataFunction</code>. Set the value to your Queue URL (in scratchpad.txt).</summary><p>
188+
189+
1. Open the [Lambda console](https://console.aws.amazon.com/lambda)
190+
1. Open the function containing `IngestUnicornRawDataFunction` in the name
191+
1. Scroll down and populate the `OUTPUT_QUEUE` key with the Queue URL value from your scratchpad
192+
1. Click **Save**
193+
</p></details>
194+
195+
<details>
196+
<summary>2. Add an S3 trigger to <code>IngestUnicornRawDataFunction</code>. Trigger off your S3 bucket and `raw/` prefix.</summary><p>
197+
198+
1. Scroll up and click **Add trigger** in the Designer view
199+
1. Select **S3**
200+
1. Choose the data bucket you created
201+
1. For the prefix, type `raw/`
202+
1. Click **Add**
203+
204+
If the trigger won't save, make sure the S3 bucket does not have an identical active event ([Bucket](https://console.aws.amazon.com/s3) > Properties > Events).
205+
</p></details>
206+
207+
<details>
208+
<summary>3. Update the <code>OUTPUT_BUCKET</code> environment variable in <code>TransformAndMapDataFunction</code>. Set the value to your bucket name.</summary><p>
209+
210+
1. Open the [Lambda console](https://console.aws.amazon.com/lambda)
211+
1. Open the function containing `TransformAndMapDataFunction` in the name
212+
1. Scroll down and populate the `OUTPUT_BUCKET` key with the Bucket Name value from your scratchpad. Keep in mind, just provide the name of the data bucket you created earlier; it should not fully qualified.
213+
1. Click **Save**
214+
</p></details>
215+
216+
<details>
217+
<summary>4. Add an SQS trigger to <code>TransformAndMapDataFunction</code>. Trigger off your <code>IngestedRawDataFanOutQueue</code> queue.</summary><p>
218+
219+
1. Scroll up and click **Add trigger** in the Designer view
220+
2. Select **SQS**
221+
3. Choose the `IngestedRawDataFanOutQueue` queue you created
222+
4. Click **Add**
223+
</p></details><br>
224+
225+
Let's recap what we created:
226+
* Serverless data processing pipeline:
227+
1. A Lambda function that reads a large JSON file from S3 and places a message in a queue for each ride
228+
1. A queue that buffers messages for each ride
229+
1. A Lambda function that picks up messages in the queue and matches the nearest weather station
230+
* Review the code for `TransformAndMapDataFunction`, the function is doing a lookup for the nearest weather station
231+
* Preconfigured IAM role for the Lambda functions scoped to the appropriate services
232+
* We also have a [CloudWatch dashboard](https://console.aws.amazon.com/cloudwatch/home?#dashboards:name=Wild_Rydes_Machine_Learning) to monitor progress!
233+
234+
### Step 5: Test your pipeline
235+
It's time to upload our ride telemetry data into our pipeline.
236+
237+
<details>
238+
<summary>Upload <code>assets/ride_data.json</code> into <code>YOUR_DATA_BUCKET/raw/</code> (Expand for detailed instructions)</summary><p>
239+
240+
1. In your Cloud9 terminal, run the following code:
241+
```
242+
# Run this command to upload the ride data
243+
aws s3 cp assets/ride_data.json s3://$bucket/raw/ride_data.json
244+
245+
# Run this command to verify the file was uploaded (you should see the file name listed)
246+
aws s3 ls s3://$bucket/raw/
247+
```
248+
</p></details>
249+
250+
</p></details>
251+
252+
### Your fan-out is in progress!
253+
254+
It will take ~8 minutes to process all ~20k entries. Monitor the progress using:
255+
* [CloudWatch dashboard](https://console.aws.amazon.com/cloudwatch/home?#dashboards:)
256+
* This shows you the number of invocations for each Lambda function on the left and some SQS message metrics on the right.
257+
* Refresh the dashboard to see how the plotted invocation count changes. When you see the invocation count settle back to zero, that indicates all of your data has been processed. Until then, feel free to explore the graphs and options.
258+
* [SQS console](https://console.aws.amazon.com/sqs)
259+
* This shows the number of messages flowing through `IngestedRawDataFanOutQueue`. There are also dead letter queues set up in case things go wrong.
260+
* Run `aws s3 ls s3://$bucket/processed/ | wc -l` in your Cloud9 terminal
261+
* This provides a count of the number of entries in your processed folder as the pipeline progresses.
262+
* When complete you should have 19,995 records.
263+
264+
## Next step:
265+
266+
We're ready to proceed with building and training a [machine learning model](../2_ModelBuilding).
Loading

0 commit comments

Comments
 (0)