You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| In this podcast we talk about the differences between data scientists, analysts and engineers. Which are the three main data science jobs. All three are super important. This makes it easy to decide
112
-
| [Watch on YouTube](https://youtu.be/64TYZETOEdQ) \ [Listen on Anchor](https://anchor.fm/andreaskayy/episodes/050-Data-Engineer-Scientist-or-Analyst-Which-One-Is-For-You-e45ibl)
113
-
114
-
115
-
### Data Engineer
116
-
117
97
Data Engineers are the link between the management's data strategy
118
-
and the data scientists who need to work with data.
98
+
and the data scientists or analysts that need to work with data.
119
99
120
100
What they do is build the platforms that enable data scientists to do
121
101
their magic.
@@ -148,159 +128,6 @@ infrastructure like at Amazon or Google, or on-premise hardware.
148
128
|In this episode Kate Strachnyi interviews me for her humans of data science podcast. We talk about how I found out that I am more into the engineering part of data science.
149
129
|[Watch on YouTube](https://youtu.be/pIZkTuN5AMM) \ [Listen on Anchor](https://anchor.fm/andreaskayy/episodes/048-From-Wannabe-Data-Scientist-To-Engineer-My-Journey-e45i2o)|
150
130
151
-
### Data Scientist
152
-
153
-
Data scientists aren't like every other scientist.
154
-
155
-
Data scientists do not wear white coats or work in high-tech labs full
156
-
of science fiction movie equipment. They work in offices just like you
157
-
and me.
158
-
159
-
What differs them from most of us is that they are math experts. They
160
-
use linear algebra and multivariable calculus to create new insight from
161
-
existing data.
162
-
163
-
How exactly does this insight look?
164
-
165
-
Here's an example:
166
-
167
-
An industrial company produces a lot of products that need to be tested
168
-
before shipping.
169
-
170
-
Usually such tests take a lot of time because there are hundreds of
171
-
things to be tested -- all to make sure that your product is not broken.
172
-
173
-
Wouldn't it be great to know early if a test fails ten steps down the
174
-
line? if you knew that you could skip the other tests and just trash the
175
-
product or repair it?
176
-
177
-
That's exactly where a data scientist can help you, big time. This field
178
-
is called predictive analytics, and the technique of choice is machine
179
-
learning.
180
-
181
-
Machine what? Learning?
182
-
183
-
Yes, machine learning, it works like this:
184
-
185
-
You feed an algorithm with measurement data. It generates a model and
186
-
optimises it based on the data you fed it. That model basically
187
-
represents a pattern of how your data looks. You show that model
188
-
new data, and the model will tell you if the data still represents the
189
-
data you have trained it with. This technique can also be used for
190
-
predicting machine failure in advance with machine learning. Of course,
191
-
the whole process is not that simple.
192
-
193
-
The actual process of training and applying a model is not that hard. A
194
-
lot of work for the data scientist is to figure out how to pre-process
195
-
the data that gets fed to the algorithms.
196
-
197
-
In order to train an algorithm, you need useful data. If you use just any data
198
-
for the training the produced model will be very unreliable.
199
-
200
-
An unreliable model for predicting machine failure would tell you that
201
-
your machine is damaged even if it is not. Or even worse: It would tell
202
-
you the machine is ok even when there is a malfunction.
203
-
204
-
Model outputs are very abstract. You also need to post-process the model
When you look at it, you have two very important places where you have data.
258
-
259
-
You have in the training phase two types of data:
260
-
data that you use for the training; data that basically configures the model, the hyperparameter configuration.
261
-
262
-
Once you're in production, you have the live data streaming in, data from from an app, from
263
-
a IoT device, logs, or whatever.
264
-
265
-
A data catalog is also important. It explains which features are available and how different data sets are labeled.
266
-
267
-
These are all different types of data. Now, here comes the engineering part.
268
-
269
-
The Data Engineer's part is making this data available, available to the data scientist and the machine learning process.
270
-
271
-
So, when you look at the model, on the left side you have your hyperparameter configuration. You need to store and manage these configurations somehow.
272
-
273
-
Then you have the actual training data.
274
-
275
-
There's a lot going on with the training data.
276
-
277
-
Where does it come from? Who owns it? Which is basically data governance.
278
-
279
-
What's the lineage? Have you modified this data? What did you do? What was the basis, the raw data?
280
-
281
-
You need to access all this data somehow, in training and in production.
282
-
283
-
In production, you need to have access to the live data.
284
-
285
-
All this is the data engineer's job. Making the data available.
286
-
287
-
First, an architect needs to build the platform. This can also be a good data engineer.
288
-
289
-
Then, the data engineer needs to build the pipelines. How is the data coming in, and how does the platform
290
-
connect to other systems.
291
-
292
-
How is that data then put into the storage? Is pre-processing for the algorithms necessary? The data engineer will do it.
293
-
294
-
Once the data and the systems are available, it's time for the machine learning part.
295
-
296
-
It is ready for processing, for the data scientist.
297
-
298
-
Once the analytics is done, the data engineer needs to build pipelines to make it then accessible again, for instance for other analytics processes, for APIs, for front ends, and so on.
299
-
300
-
All in all, the data engineer's part is a computer science part.
301
-
302
-
That's why I love it so much. :)
303
-
304
131
305
132
## My Data Science Platform Blueprint
306
133
@@ -462,20 +289,6 @@ build the perfect application.
462
289
463
290
## Who Companies Need
464
291
465
-
For a company, it is important to have well-trained data
466
-
engineers and data scientists. Think of the data scientist as a
467
-
professional race car driver. A fit athlete with talent and driving
468
-
skills like you have never seen before.
469
-
470
-
What he needs to win races is someone who will provide him the perfect
471
-
race car to drive. It is the data engineer/solution architect who will design and build the race car.
472
-
473
-
Like the driver and the race car engineer, the data scientist and the data engineer need to work closely together. They need to know the different big-data tools inside out.
474
-
475
-
That's why companies are looking for people with Spark experience. Spark is the common ground between the data engineer and the data scientist that drives innovation.
292
+
For a company, it is important to have well-trained data engineers.
476
293
477
-
Spark gives data scientists the tools to do analytics and helps
478
-
engineers to bring the data scientist's algorithms into production.
479
-
After all, those two decide how good the data platform is, how good the
480
-
analytics insight is, and how fast the whole system gets into a
481
-
production-ready state.
294
+
That's why companies are looking for people with experience of tools in every part of the above platform blueprint. One common theme I see is cloud platform experience on AWS, Azure or GCP.
0 commit comments