All things involving Data Processing; such as Hadoop, Spark, and Pentaho.
- Ansible Script for EC2
Resources:
- Hadoop: The Definitive Guide - O'Reilly Media
- Udemy: The Hadoop Ecosystem Masterclass
- Apache Ambari
- HDP
- Numerical Summarization
- Inverted Index Summarization
- Filtering
- Bloom Filter
- Top N
- Distinct
- Structured to Hierarchical
- Partitioning
- Binning
- Total Order Sorting
- Shuffling
- Reduce Side Join
- Replicated Join
- Composite Join
- Cartesian Product
- Job Chaining
- Chain Folding
- Job Merging
- Input and Output
- Generalizing Data
- External Output
- External Input
- Partition Pruning
Resources: MapReduce Design Patterns - O'Reilly Media
- Spring for Hadoop
- Spring Cloud Dataflow
- Spring Integration
- Zeppelin
- AMBARI
- HDFS
- HDFS NFS Gateway - mount HDFS to the local file system
- AWS S3
- MapReduce2 - batch processing
- Spark - alternative to MR2, mainly memory-based and simpler dev API, has DAGs
- Pig - scripting layer on top of MR2 or Tez
- Ignite
- Tez - used by Pig and Hive, runs everything in a single Job optimized DAG, runs on YARN, probably wont use this directly
- Spark Streaming - windowed processing, continuously runs the app, works with microbatches
- Storm - compared to spark this is a true realtime processing instead of microbatches
- Flink - clever distribution, extremely scalable/fault tolerant
- HBase - OLTP storing data on HDFS
- Hive - SQL facade for Hadoop
- Drill - SQL Engine for Files, Hive, RDS, Mongo...etc
- Phoenix - SQL engine for HBase
- MRQL
- Sqoop - SQL+Hadoop, imports/exports RDBMS data into HDFS/Hive
- Flume - data sink
- NiFi
- Avro - data serialization and exchange, schema in payload, forward-backward data compatibility
- Parquet - columnar file format
- Spring Cloud Dataflow
- Oozie - job coordination, workflow, decision, error handling
- LinkedIn Azkaban
- Falcon
- Schedoscope
- Thrift - small JDBC server for Hive
- Livy - REST API for managing spark jobs
- Kafka
- Solr
- Mahout - distributed recommendation, classification, and clustering engine
- YARN
- ZooKeeper - configuration management
- Curator - use this on top of Zookeeper to keep your sanity
- LinkedIn Norbert
- Mesos
- Mahout
- SystemML
- Sentry
- Knox Gateway
- Ranger
- Atlas - data governance, metadata management, auditing, security and policy engine