Skip to content

Configuration As Code

Harshad Ranganathan edited this page Jan 21, 2020 · 3 revisions

We use YAML to define the emr flow definitions. YAML is designed to be readable by humans and is a superset of JSON.

You can refer this guide to learn YAML.

YAML configuration files follow run_job_flow syntax definition in boto3 library documentation - https://boto3.readthedocs.io/en/latest/reference/services/emr.html#EMR.Client.run_job_flow

All job flow definitions need to be specified under emr key.

To provide flexibility and re-usability, define job flow definitions which will be the same across all environments in default.yaml configuration file and any environment specific definitions in separate configuration files.

You can then provide these configuration files to the script which will then merge the properties to create EMR cluster with the consolidated definition.

We currently support only below job flow definitions -

1. EMR Name
2. Log Uri
3. Release Label
4. Instances
5. Applications
6. BootstrapActions
7. Configurations
8. JobFlowRole
9. ServiceRole
10.Tags

Sample Config Definition

For example, consider a simple case where we want to define the Applications for the cluster. As per boto reference guide, below is the request syntax -

Applications=[
    {
        'Name': 'string',
        'Version': 'string',
        'Args': [
            'string',
        ],
        'AdditionalInfo': {
            'string': 'string'
        }
    },
]

We will define the above in our yaml config as follows -

emr:
  Applications:
    - Name: Spark
      Version: 2.2.0
    - Name: Ganglia

Notice that the keys Applications, Name, Version are named as such as in the boto reference guide.

Traditionally, we would have written this in a python code and ended up with non-reusable code as different applications have different needs.

But now everything is configurable as we want without requiring any code updates.

Instance Definitions

With respect to Instance definitions they need to be specified with a region key.

Notice below that the instance definitions are given under the region key us-east-1 which ensures that the cluster is created in us-east-1 region.

emr:
    us-east-1:
      Instances:
        InstanceGroups:
          - Name: Master-instance-group-1
            Market: ON_DEMAND
            InstanceRole: MASTER
            InstanceType: *MasterInstanceType
            InstanceCount: *MasterInstanceCount
          - Name: Core-instance-group-2
            Market: SPOT
            InstanceRole: CORE
            BidPrice: *BidPrice
            InstanceType: *SpotInstanceType
            InstanceCount: *SpotInstanceCount
      JobFlowRole: EMR_EC2_DefaultRole
      ServiceRole: EMR_DefaultRole
Clone this wiki locally