-
Notifications
You must be signed in to change notification settings - Fork 0
Configuration As Code
We use YAML to define the emr flow definitions. YAML is designed to be readable by humans and is a superset of JSON.
You can refer this guide to learn YAML.
YAML configuration files follow run_job_flow syntax definition in boto3 library documentation - https://boto3.readthedocs.io/en/latest/reference/services/emr.html#EMR.Client.run_job_flow
All job flow definitions need to be specified under emr
key.
To provide flexibility and re-usability, define job flow definitions which will be the same across all environments in default.yaml
configuration file and any environment specific definitions in separate configuration files.
You can then provide these configuration files to the script which will then merge the properties to create EMR cluster with the consolidated definition.
We currently support only below job flow definitions -
1. EMR Name
2. Log Uri
3. Release Label
4. Instances
5. Applications
6. BootstrapActions
7. Configurations
8. JobFlowRole
9. ServiceRole
10.Tags
For example, consider a simple case where we want to define the Applications
for the cluster. As per boto reference guide, below is the request syntax -
Applications=[
{
'Name': 'string',
'Version': 'string',
'Args': [
'string',
],
'AdditionalInfo': {
'string': 'string'
}
},
]
We will define the above in our yaml config as follows -
emr:
Applications:
- Name: Spark
Version: 2.2.0
- Name: Ganglia
Notice that the keys Applications
, Name
, Version
are named as such as in the boto reference guide.
Traditionally, we would have written this in a python code and ended up with non-reusable code as different applications have different needs.
But now everything is configurable as we want without requiring any code updates.
With respect to Instance definitions they need to be specified with a region key.
Notice below that the instance definitions are given under the region key us-east-1
which ensures that the cluster is created in us-east-1
region.
emr:
us-east-1:
Instances:
InstanceGroups:
- Name: Master-instance-group-1
Market: ON_DEMAND
InstanceRole: MASTER
InstanceType: *MasterInstanceType
InstanceCount: *MasterInstanceCount
- Name: Core-instance-group-2
Market: SPOT
InstanceRole: CORE
BidPrice: *BidPrice
InstanceType: *SpotInstanceType
InstanceCount: *SpotInstanceCount
JobFlowRole: EMR_EC2_DefaultRole
ServiceRole: EMR_DefaultRole