works on big data
relational databases have scalabality issues cant handle TB and PB arent fast (real time etc) querability sophisticated processing (eg ML)
One of the core components of Hadoop is an alternate file system - HDFS - hadoop distributed file system
NoSQL databases –> key/value (redis), columnstore(document, graph) etc
Hadoop is simply a new/alternative filesystem with a processing library Hadoop most commonly implemented with HBase HDFS was developed at Google - (called GFS - google file system to index the www) The open source community took the GFS whitepaper and made HDFS from it. they are vv similar
HBase is a NoSQL database - it is a wide columnstore - has a key and n values eg: customer id - and customer info as key value pairs
what makes it NoSQL is that the attributes can vary. you can have different attributes for different customers
CAP theory
consistency - supports transactions, availability - up-time(data always availabile) partionining - partionining means the data can be stored as chunks - parallel processing of data possible be of P
traditional rdbms have C and A
cap theory says two of three needed for successful databases comapnies need scalabilty, which hadoop has. Hadoop has P - 3 copies of data made by default it has A - because of 3 copies, your data is always there
Hadoop doesnt have C - so, mission critical data, eg transactional data not a good fit for Hadoop
What kind of data for Hadoop? LOB - line of bussiness transactional data - not a good fit
Behavioral data this data will be batch processed (not individually) good fit for Hadoop where C isn’t vv critical
The big data in the Hadoop DFS can be processed by rdbms and NoSQL query methods
What is Hadoop? 2 components + Projects
- HDFS (it is a open source datastroage method)
- MapReduce - processing API
- Hive, Pig, HBase (optional)
Hadoop Distributions : Open source: Apache Hadoop Commerical: Cloudera, Hortonworks, MapR Cloud: AWS, Azure HDInsight
On cloud (eg. AWS), you can implement Apache’s opensource version or the cloudera commercial ones etc
The Commerical folks provide additonal tooling, management, support, libararies etc
Why Hadoop? cheaper, faster (parallel data processing), better (for particular types of bigdata)
We have mapreduce processing algo for parallel data processing in Hadoop
examples of Behavioral data: recommendation engine, risk modelling, customer churn analysis, anamoly detection, ad targeting, search quality
Difference between HBase and Hadoop Hadoop core is a HDFS, processing framework - mapreduce the HDFS is a big data filesystem, so each chunk is 64mb or 128mb or can be configured to even more
MapReduce is written in java. folks dont want to query the data in the hdfs directly, they want an abstraction, an api that makes the querying easier. they want an orm, MapReduce is that orm.
MapReduce : Hadoop :: C++ : OOP
It is a libarary that has an id and a dict associated with that id this is called schema on read since the attributes arent enforced on each id, they can have their own attributes Hive is an query language for HBase
Hadoop processes runs in seperate JVMs JVMs don’t share state JVM processes differ b/w Hadoop 1 and 2
Hadoop can be used with multiple filesystems as well. eg:
HDFS (larger chunks, replication 3 fold) –> it has 2 modes (distributed - 3 copies, pseudo-distributed - used for testing)
Regular file system aka standalone mode, works on a single mode
Cloud file system –> s3 on aws, blob on azure
files and jvms: single node - local file system, single jvm
pseudo distributed: uses HDFS, JVM daemons runs processes on a single machine
fully pseudo-distributed: uses HDFS, daemons runs on various locations
Hadoop ecosystem HDFS HBase YARN - MapReduce version 2 is called yarn - yet another resource negotiater Hive - sql like query language used to query HBase Pig - scripting language used for ETL(extract, transform, load) like processes Mahout - ML Oozie - for workflows and scheduling Zookeeper - coordination Sqoop - data exchanger flume - log collector ambari - manages Hadoop clusters
Cloudera additionally has Hue - UI framework which makes interacting with Hadoop easier
Impala, Hive, Pig are all used to query the MapReduce, provide a layer of abstraction Hive has HQL - hive query langage - it is very similar to SQL used in rdbms
you can set up ec2, install Hadoop OR use Elastic (Up and running with )