Skip to content

Commit 87cf92a

Browse files
author
andkret
committed
Removed Hadoop
Removed the whole Hadoop section. It's just not that important anymore
1 parent f88b4f4 commit 87cf92a

File tree

2 files changed

+0
-158
lines changed

2 files changed

+0
-158
lines changed

README.md

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -130,13 +130,6 @@ If you look for the old PDF version it's [here](https://github.com/andkret/Cookb
130130
- [Scaling Up](sections/03-AdvancedSkills.md#scaling-up)
131131
- [Scaling Out](sections/03-AdvancedSkills.md#scaling-out)
132132
- [When not to Do Big Data](sections/03-AdvancedSkills.md#please-dont-go-big-data)
133-
- [Hadoop Platforms](sections/03-AdvancedSkills.md#hadoop-platforms)
134-
- [What is Hadoop](sections/03-AdvancedSkills.md#what-is-hadoop)
135-
- [What makes Hadoop so popular](sections/03-AdvancedSkills.md#what-makes-hadoop-so-popular)
136-
- [Hadoop Ecosystem Components](sections/03-AdvancedSkills.md#hadoop-ecosystem-components)
137-
- [Hadoop is Everywhere?](sections/03-AdvancedSkills.md#hadoop-is-everywhere)
138-
- [Should You Learn Hadoop?](sections/03-AdvancedSkills.md#should-you-learn-hadoop)
139-
- [How to Select Hadoop Cluster Hardware](sections/03-AdvancedSkills.md#how-to-select-hadoop-cluster-hardware)
140133
- [Connect](sections/03-AdvancedSkills.md#connect)
141134
- [REST APIs](sections/03-AdvancedSkills.md#rest-apis)
142135
- [API Design](sections/03-AdvancedSkills.md#api-design)

sections/03-AdvancedSkills.md

Lines changed: 0 additions & 151 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,6 @@ Advanced Data Engineering Skills
1414
- [Scaling Up](03-AdvancedSkills.md#scaling-up)
1515
- [Scaling Out](03-AdvancedSkills.md#scaling-out)
1616
- [When not to Do Big Data](03-AdvancedSkills.md#please-dont-go-big-data)
17-
- [Hadoop Platforms](03-AdvancedSkills.md#hadoop-platforms)
18-
- [What is Hadoop](03-AdvancedSkills.md#what-is-hadoop)
19-
- [What makes Hadoop so popular](03-AdvancedSkills.md#what-makes-hadoop-so-popular)
20-
- [Hadoop Ecosystem Components](03-AdvancedSkills.md#hadoop-ecosystem-components)
21-
- [Hadoop is Everywhere?](03-AdvancedSkills.md#hadoop-is-everywhere)
22-
- [Should You Learn Hadoop?](03-AdvancedSkills.md#should-you-learn-hadoop)
23-
- [How to Select Hadoop Cluster Hardware](03-AdvancedSkills.md#how-to-select-hadoop-cluster-hardware)
2417
- [Connect](03-AdvancedSkills.md#connect)
2518
- [REST APIs](03-AdvancedSkills.md#rest-apis)
2619
- [API Design](03-AdvancedSkills.md#api-design)
@@ -340,150 +333,6 @@ If you don't need it it's making absolutely no sense at all!
340333
On the other side: If you really need big data tools they will save your
341334
ass :)
342335

343-
## Hadoop Platforms
344-
345-
When people talk about big data, one of the first things come to mind is
346-
Hadoop. Google's search for Hadoop returns about 28 million results.
347-
348-
It seems like you need Hadoop to do big data. Today I am going to shed
349-
light onto why Hadoop is so trendy.
350-
351-
You will see that Hadoop has evolved from a platform into an ecosystem.
352-
Its design allows a lot of Apache projects and 3rd party tools to
353-
benefit from Hadoop.
354-
355-
I will conclude with my opinion on, if you need to learn Hadoop and if
356-
Hadoop is the right technology for everybody.
357-
358-
### What is Hadoop
359-
360-
Hadoop is a platform for distributed storing and analyzing of very large
361-
data sets.
362-
363-
Hadoop has four main modules: Hadoop common, HDFS, MapReduce and YARN.
364-
The way these modules are woven together is what makes Hadoop so
365-
successful.
366-
367-
The Hadoop common libraries and functions are working in the background.
368-
That's why I will not go further into them. They are mainly there to
369-
support Hadoop's modules.
370-
371-
| Podcast Episode: #060 What Is Hadoop And Is Hadoop Still Relevant In 2019?
372-
|------------------|
373-
|An introduction into Hadoop HDFS, YARN and MapReduce. Yes, Hadoop is still relevant in 2019 even if you look into serverless tools.
374-
| [Watch on YouTube](https://youtu.be/8AWaht3YQgo) \ [Listen on Anchor](https://anchor.fm/andreaskayy/episodes/060-What-Is-Hadoop-And-Is-Hadoop-Still-Relevant-In-2019-e45ijp)|
375-
376-
377-
### What makes Hadoop so popular?
378-
379-
Storing and analyzing data as large as you want is nice. But what makes
380-
Hadoop so popular?
381-
382-
Hadoop's core functionality is the driver of Hadoop's adoption. Many
383-
Apache side projects use it's core functions.
384-
385-
Because of all those side projects Hadoop has turned more into an
386-
ecosystem. An ecosystem for storing and processing big data.
387-
388-
To better visualize this eco system I have drawn you the following
389-
graphic. It shows some projects of the Hadoop ecosystem who are closely
390-
connected with the Hadoop.
391-
392-
It is not a complete list. There are many more tools that even I don't
393-
know. Maybe I am drawing a complete map in the future.
394-
395-
![Hadoop Ecosystem Components](/images/Hadoop-Ecosystem.jpg)
396-
397-
### Hadoop Ecosystem Components
398-
399-
Remember my big data platform blueprint? The blueprint has four stages:
400-
Ingest, store, analyse and display.
401-
402-
Because of the Hadoop ecosystem the different tools in these stages can
403-
work together perfectly.
404-
405-
Here's an example:
406-
407-
![Connections between tools](/images/Hadoop-Ecosystem-Connections.jpg)
408-
409-
You use Apache Kafka to ingest data, and store it in the HDFS. You do
410-
the analytics with Apache Spark and as a backend for the display you
411-
store data in Apache HBase.
412-
413-
To have a working system you also need YARN for resource management. You
414-
also need Zookeeper, a configuration management service to use Kafka and
415-
HBase
416-
417-
As you can see in the picture below each project is closely connected to
418-
the other.
419-
420-
Spark for instance, can directly access Kafka to consume messages. It is
421-
able to access HDFS for storing or processing stored data.
422-
423-
It also can write into HBase to push analytics results to the front end.
424-
425-
The cool thing of such ecosystem is that it is easy to build in new
426-
functions.
427-
428-
Want to store data from Kafka directly into HDFS without using Spark?
429-
430-
No problem, there is a project for that. Apache Flume has interfaces for
431-
Kafka and HDFS.
432-
433-
It can act as an agent to consume messages from Kafka and store them
434-
into HDFS. You even do not have to worry about Flume resource
435-
management.
436-
437-
Flume can use Hadoop's YARN resource manager out of the box.
438-
439-
![Flume Integration](/images/Hadoop-Ecosystem-Connections-Flume.jpg)
440-
441-
### Hadoop Is Everywhere?
442-
443-
Although Hadoop is so popular it is not the silver bullet. It isn't the
444-
tool that you should use for everything.
445-
446-
Often times it does not make sense to deploy a Hadoop cluster, because
447-
it can be overkill. Hadoop does not run on a single server.
448-
449-
You basically need at least five servers, better six to run a small
450-
cluster. Because of that. the initial platform costs are quite high.
451-
452-
One option you have is to use a specialized systems like Cassandra,
453-
MongoDB or other NoSQL DB's for storage. Or you move to Amazon and use
454-
Amazon's Simple Storage Service, or S3.
455-
456-
Guess what the tech behind S3 is. Yes, HDFS. That's why AWS also has the
457-
equivalent to MapReduce named Elastic MapReduce.
458-
459-
The great thing about S3 is that you can start very small. When your
460-
system grows you don't have to worry about S3's server scaling.
461-
462-
### Should you learn Hadoop?
463-
464-
Yes, I definitely recommend you to get to know how Hadoop works and how
465-
to use it. As I have shown you in this article, the ecosystem is quite
466-
large.
467-
468-
Many big data projects use Hadoop or can interface with it. That's why
469-
it is generally a good idea to know as many big data technologies as
470-
possible.
471-
472-
Not in depth, but to the point that you know how they work and how you
473-
can use them. Your main goal should be to be able to hit the ground
474-
running when you join a big data project.
475-
476-
Plus, most of the technologies are open source. You can try them out for
477-
free.
478-
479-
### How does a Hadoop System architecture look like
480-
481-
### What tools are usually in a with Hadoop Cluster
482-
483-
Yarn Zookeeper HDFS Oozie Flume Hive
484-
485-
### How to select Hadoop Cluster Hardware
486-
487336

488337
## Connect
489338

0 commit comments

Comments
 (0)