Skip to content

Commit 4405e28

Browse files
authored
Merge pull request #210 from andkret/Version-3
Merge with newer version
2 parents 30ffd61 + 448bd5e commit 4405e28

File tree

2 files changed

+193
-248
lines changed

2 files changed

+193
-248
lines changed

sections/01-Introduction.md

Lines changed: 18 additions & 205 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,7 @@ Introduction
55
## Contents
66

77
- [What is this Cookbook](01-Introduction.md#what-is-this-cookbook)
8-
- [Data Engineer vs Data Scientist](01-Introduction.md#data-engineer-vs-data-scientist)
9-
- [Data Engineer](01-Introduction.md#data-engineer)
10-
- [Data Scientist](01-Introduction.md#data-scientist)
11-
- [Machine Learning Workflow](01-Introduction.md#machine-learning-workflow)
12-
- [Machine Learning Model and Data](01-Introduction.md#machine-learning-model-and-data)
8+
- [Data Engineers](01-Introduction.md#data-engineers)
139
- [My Data Science Platform Blueprint](01-Introduction.md#my-data-science-platform-blueprint)
1410
- [Connect](01-Introduction.md#connect)
1511
- [Buffer](01-Introduction.md#buffer)
@@ -59,33 +55,25 @@ You can also write me an email any time to
5955
plumbersofdatascience\@gmail.com anytime.
6056

6157
**This Cookbook is and will always be free!**
62-
I don't want to sell you this book, but please support what you like and
63-
join my Patreon: <https://www.patreon.com/plumbersofds>.
64-
Or send me a message and support through PayPal: <https://paypal.me/feedthestream>
65-
66-
Check out this podcast episode where I talk in detail why I decided to
67-
share all this information for free: [\#079 Trying to stay true to
68-
myself and making the cookbook public on
69-
GitHub](https://youtu.be/k1bS5aSPos8)
70-
7158

7259

7360
## If You Like This Book & Need More Help:
74-
Check out my Data Engineering Academy and personal Coaching at LearnDataEngineering.com
61+
Check out my Data Engineering Academy at LearnDataEngineering.com
7562

7663
**Visit learndataengineering.com:** [Click Here](https://learndataengineering.com)
7764

78-
- New content every week!
79-
- Step by step course, from researching job postings to creating and doing your project, to job application tips.
80-
- Full AWS Data Engineering example project (Azure in development).
81-
- 1+ hours Ultimate Introduction to Data Engineering course.
82-
- Data Engineering Fundamentals course.
83-
- Data Platform & Pipeline Design course.
84-
- Apache Spark Fundamentals course.
85-
- Choosing Data Stores Course.
86-
- Private Member Slack Workspace (lifetime access).
87-
- Weekly Q&A live stream & Archive.
88-
- Currently over 24 hours of videos.
65+
- Huge Step by step Data Engineering Course
66+
- Unlimited access incl. future courses during subsciption
67+
- Access to all courses and example projects in the Academy
68+
- Associate Data Engineer Certification
69+
- Data Engineering on AWS E-Commerce example project
70+
- Microsoft Azure example project
71+
- Document Streaming example project with Docker, FastAPI, Apache Kafka, Apache Spark,
72+
- MongoDB and Streamlit
73+
- Time Series example project with InfluxDB and Grafana
74+
- Lifetime access to the private Discord Workspace
75+
- Course certificates
76+
- Currently over 40 hours of videos
8977

9078

9179
## Support This Book For Free!
@@ -102,20 +90,12 @@ Please use the "Issues" function for comments.
10290

10391

10492

105-
Data Engineer vs Data Scientist
93+
Data Engineers
10694
-------------------------------
10795

10896

109-
| Podcast Episode: #050 Data Engineer, Scientist or Analyst - Which One Is For You?
110-
|-----------------------------------------------------------------------------------
111-
| In this podcast we talk about the differences between data scientists, analysts and engineers. Which are the three main data science jobs. All three are super important. This makes it easy to decide
112-
| [Watch on YouTube](https://youtu.be/64TYZETOEdQ) \ [Listen on Anchor](https://anchor.fm/andreaskayy/episodes/050-Data-Engineer-Scientist-or-Analyst-Which-One-Is-For-You-e45ibl)
113-
114-
115-
### Data Engineer
116-
11797
Data Engineers are the link between the management's data strategy
118-
and the data scientists who need to work with data.
98+
and the data scientists or analysts that need to work with data.
11999

120100
What they do is build the platforms that enable data scientists to do
121101
their magic.
@@ -148,159 +128,6 @@ infrastructure like at Amazon or Google, or on-premise hardware.
148128
|In this episode Kate Strachnyi interviews me for her humans of data science podcast. We talk about how I found out that I am more into the engineering part of data science.
149129
| [Watch on YouTube](https://youtu.be/pIZkTuN5AMM) \ [Listen on Anchor](https://anchor.fm/andreaskayy/episodes/048-From-Wannabe-Data-Scientist-To-Engineer-My-Journey-e45i2o)|
150130

151-
### Data Scientist
152-
153-
Data scientists aren't like every other scientist.
154-
155-
Data scientists do not wear white coats or work in high-tech labs full
156-
of science fiction movie equipment. They work in offices just like you
157-
and me.
158-
159-
What differs them from most of us is that they are math experts. They
160-
use linear algebra and multivariable calculus to create new insight from
161-
existing data.
162-
163-
How exactly does this insight look?
164-
165-
Here's an example:
166-
167-
An industrial company produces a lot of products that need to be tested
168-
before shipping.
169-
170-
Usually such tests take a lot of time because there are hundreds of
171-
things to be tested -- all to make sure that your product is not broken.
172-
173-
Wouldn't it be great to know early if a test fails ten steps down the
174-
line? if you knew that you could skip the other tests and just trash the
175-
product or repair it?
176-
177-
That's exactly where a data scientist can help you, big time. This field
178-
is called predictive analytics, and the technique of choice is machine
179-
learning.
180-
181-
Machine what? Learning?
182-
183-
Yes, machine learning, it works like this:
184-
185-
You feed an algorithm with measurement data. It generates a model and
186-
optimises it based on the data you fed it. That model basically
187-
represents a pattern of how your data looks. You show that model
188-
new data, and the model will tell you if the data still represents the
189-
data you have trained it with. This technique can also be used for
190-
predicting machine failure in advance with machine learning. Of course,
191-
the whole process is not that simple.
192-
193-
The actual process of training and applying a model is not that hard. A
194-
lot of work for the data scientist is to figure out how to pre-process
195-
the data that gets fed to the algorithms.
196-
197-
In order to train an algorithm, you need useful data. If you use just any data
198-
for the training the produced model will be very unreliable.
199-
200-
An unreliable model for predicting machine failure would tell you that
201-
your machine is damaged even if it is not. Or even worse: It would tell
202-
you the machine is ok even when there is a malfunction.
203-
204-
Model outputs are very abstract. You also need to post-process the model
205-
outputs to receive the outputs you desire
206-
207-
![The Machine Learning Pipeline](/images/Machine-Learning-Pipeline.jpg)
208-
209-
210-
### Machine Learning Workflow
211-
212-
![The Machine Learning Workflow](/images/Machine-Learning-Workflow.jpg)
213-
214-
Data Scientists and Data Engineers. How does that all fit together?
215-
216-
You have to look at the data science process, how stuff is created and how data
217-
science is done. How machine learning is
218-
done.
219-
220-
The machine learning process shows that you start with a training phase, a phase where you basically train the algorithms to create the right output.
221-
222-
In the learning phase, you have the input parameters (basically the configuration of the model), and you have the input data.
223-
224-
What you do is train the algorithm. While training the algorithm modifies the training
225-
parameters, it also modifies the used data. Then you get an output.
226-
227-
Once you get an output, you evaluate. Is that output okay, or is that output not the desired output?
228-
229-
If the output is not what you were looking for, then you continue with the training phase.
230-
231-
You may retrain the model hundreds, thousands, hundred thousands of times. Of course, all this is being done automatically.
232-
233-
Once you are satisfied with the output, you put the model into production. In production, it is no longer fed with training
234-
data; it's fed with the live data.
235-
236-
It evaluates the input data live and puts out live results.
237-
238-
So, you went from training to production, and then what?
239-
240-
What you do is monitor the output. If the output keeps making sense, all good!
241-
242-
If the output of the model changes and it's on longer what you have expected, it means the model doesn't work anymore.
243-
244-
You need to trigger model retraining.
245-
246-
Once you are again satisfied with the output, you put it into production again. It replaces the one in production.
247-
248-
This is the overall process of machine learning. It's how the learning part of data science works.
249-
250-
251-
### Machine Learning Model and Data
252-
253-
![The Machine Learning Model](/images/Machine-Learning-Model.jpg)
254-
255-
Now, that's all very nice.
256-
257-
When you look at it, you have two very important places where you have data.
258-
259-
You have in the training phase two types of data:
260-
data that you use for the training; data that basically configures the model, the hyperparameter configuration.
261-
262-
Once you're in production, you have the live data streaming in, data from from an app, from
263-
a IoT device, logs, or whatever.
264-
265-
A data catalog is also important. It explains which features are available and how different data sets are labeled.
266-
267-
These are all different types of data. Now, here comes the engineering part.
268-
269-
The Data Engineer's part is making this data available, available to the data scientist and the machine learning process.
270-
271-
So, when you look at the model, on the left side you have your hyperparameter configuration. You need to store and manage these configurations somehow.
272-
273-
Then you have the actual training data.
274-
275-
There's a lot going on with the training data.
276-
277-
Where does it come from? Who owns it? Which is basically data governance.
278-
279-
What's the lineage? Have you modified this data? What did you do? What was the basis, the raw data?
280-
281-
You need to access all this data somehow, in training and in production.
282-
283-
In production, you need to have access to the live data.
284-
285-
All this is the data engineer's job. Making the data available.
286-
287-
First, an architect needs to build the platform. This can also be a good data engineer.
288-
289-
Then, the data engineer needs to build the pipelines. How is the data coming in, and how does the platform
290-
connect to other systems.
291-
292-
How is that data then put into the storage? Is pre-processing for the algorithms necessary? The data engineer will do it.
293-
294-
Once the data and the systems are available, it's time for the machine learning part.
295-
296-
It is ready for processing, for the data scientist.
297-
298-
Once the analytics is done, the data engineer needs to build pipelines to make it then accessible again, for instance for other analytics processes, for APIs, for front ends, and so on.
299-
300-
All in all, the data engineer's part is a computer science part.
301-
302-
That's why I love it so much. :)
303-
304131

305132
## My Data Science Platform Blueprint
306133

@@ -462,20 +289,6 @@ build the perfect application.
462289

463290
## Who Companies Need
464291

465-
For a company, it is important to have well-trained data
466-
engineers and data scientists. Think of the data scientist as a
467-
professional race car driver. A fit athlete with talent and driving
468-
skills like you have never seen before.
469-
470-
What he needs to win races is someone who will provide him the perfect
471-
race car to drive. It is the data engineer/solution architect who will design and build the race car.
472-
473-
Like the driver and the race car engineer, the data scientist and the data engineer need to work closely together. They need to know the different big-data tools inside out.
474-
475-
That's why companies are looking for people with Spark experience. Spark is the common ground between the data engineer and the data scientist that drives innovation.
292+
For a company, it is important to have well-trained data engineers.
476293

477-
Spark gives data scientists the tools to do analytics and helps
478-
engineers to bring the data scientist's algorithms into production.
479-
After all, those two decide how good the data platform is, how good the
480-
analytics insight is, and how fast the whole system gets into a
481-
production-ready state.
294+
That's why companies are looking for people with experience of tools in every part of the above platform blueprint. One common theme I see is cloud platform experience on AWS, Azure or GCP.

0 commit comments

Comments
 (0)