Skip to content

Ansible module to start/stop/restart ec2 instances#6858

Open
AmitPhulera wants to merge 18 commits intomasterfrom
ap/ansible-ec2-restart-module
Open

Ansible module to start/stop/restart ec2 instances#6858
AmitPhulera wants to merge 18 commits intomasterfrom
ap/ansible-ec2-restart-module

Conversation

@AmitPhulera
Copy link
Copy Markdown
Contributor

https://dimagi.atlassian.net/browse/SAAS-19382

As part of our ongoing efforts to automate the restarts, we would need ansible to have a module to start/stop ec2 services. Having it in ansible as opposed to python scripts would give us the advantage of using it directly into the playbooks instead of calling a python command from ansible.

I have had a boilorplate from where claude started the implementation. After that it asked couple of good questions to reach the final implementation.

I have kept spec and implementation plan in first and second commit but will remove it before merging the PR..

Environments Affected

All

AmitPhulera and others added 16 commits April 15, 2026 17:07
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… chars)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds _describe_instances, _format_instance, _describe_and_format, and
_do_describe helpers; wires state == 'described' dispatch in main().
Also adds FakeEC2Client test helper and TestDescribed (4 tests); fixes
a one-off typo in the order-preservation test ID (18 hex chars -> 17).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…iter

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…iter

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…arning

- _result_or_emit: accept optional raw_by_id param; refresh=False path
  now uses caller-supplied raw_by_id instead of a second describe call
- _do_start / _do_stop check_mode branches: build and pass raw_by_id
- _do_restart: move module.warn below stop phase and gate it on
  stop_payload['changed'] so it only fires when stop actually ran
- _do_restart: reuse start_payload['instances'] instead of a third
  _describe_and_format call; patch previous_state from stop before_states
- test: add diff assertion to test_restarted_running_does_stop_then_start

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…+wait=False

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AmitPhulera AmitPhulera force-pushed the ap/ansible-ec2-restart-module branch from 2db5b86 to e565b50 Compare April 15, 2026 12:16
Copy link
Copy Markdown
Contributor

@gherceg gherceg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pair reviewed with @millerdev. We reviewed commits before 15354b0

target region are picked up from the standard boto3 credential chain;
in the commcare-cloud workflow the AWS_PROFILE and AWS_REGION
environment variables are exported automatically before ansible runs.
- The restarted state is implemented as stop-then-start. The stop phase
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason to issue a stop-then-start as opposed to a "reboot"? My understanding is that a reboot is faster since it doesn't put the VM on new hardware, but I know there are times where we want to do a full stop-then-start.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will essentially replace our restart processes which requires us to start and stop the instances rather than issuing reboot. This is because it is required control for APN and also the reason you mentioned of probably getting a better hardware.
I think we can add a ec2_reboot that does that but as now control machine doesn't have permission to do that. So I can probably create a followup PR with the restart permissions and updating the module?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool that makes sense. No need to worry about that followup PR until we find that we need it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is worth renaming this state to "stopped and started" just to be clear? This avoids needing to distinguish between restarted and rebooted in the future, and makes it clear what this state is doing without needing to read docs.

Comment on lines +31 to +35
instance_ids:
description: List of EC2 instance IDs to act on.
required: true
type: list
elements: str
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make this more flexible? Could it accept a name for "group" or "hostname" and translate that to the instance id(s) under the hood?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible but this is just the module and there will be a wrapper command that will call the module which will do the translation. What I am envisioning is a command like

cchq <env> ec2_stop <instance_id|host|host_group>

Do you think that should be enough or do you think there is a requirement that would need us to include the group and hostname in the module itself?

fwiw there is an example on how to do use host or host_group when you try to call it from a playbook.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems ideal to put all logic related to translating groups/hosts to instance ids inside the module. Do we need the ability to pass in instance ids to this module, or would ansible names (groups/hosts) be sufficient?

def _get_ec2_client(region):
"""Return a boto3 EC2 client. Defined as a module-level function so tests can patch it."""
try:
import boto3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this typically installed with ansible or is it a special requirement for this module?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a special requirement for the module but given this module will always be called from cchq venv boto3 should be present.

boto3 is imported in this way that it can be independently called as well and should elegantly exit if dependency is not there.

LIBRARY_DIR = os.path.abspath(os.path.join(
os.path.dirname(__file__), '..', 'src', 'commcare_cloud', 'ansible', 'library'
))
TESTS_DIR = os.path.abspath(os.path.dirname(__file__))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you define TESTS_DIR before LIBRARY_DIR and reference TESTS_DIR in the construction of LIBRARY_DIR?

Comment on lines +96 to +97
self.assertTrue(result['failed'])
self.assertIn('region', result['result']['msg'].lower())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use pytest simple assertions (assert result['failed'], assert 'region' in result...)

'instance_ids': ['i-0123456789abcdef0'],
'state': 'bogus',
})
self.assertTrue(result['failed'])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also assert that the message is as expected similar to the other tests.

Comment on lines +187 to +188
if raw is None:
module.fail_json(msg="Instance {} missing from DescribeInstances response.".format(iid))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to fail just for this instance rather than the whole group?

}


def _describe_and_format(client, instance_ids, module):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rename "format" in this context to avoid conflicting with the _format_instance helper defined above? We expected to see that called in here.

Comment on lines +274 to +283
def test_described_terminated_is_reported_not_failed(self):
fake = FakeEC2Client(instances_by_id={
'i-0123456789abcdef0': {'state': 'terminated'},
})
result = run_module(
{'instance_ids': ['i-0123456789abcdef0'], 'state': 'described'},
fake_client=fake,
)
self.assertFalse(result['failed'])
self.assertEqual(result['result']['instances'][0]['current_state'], 'terminated')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this test? We don't validate the described response so why would we expect terminated to be special?

Copy link
Copy Markdown
Contributor

@gherceg gherceg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paired again with @millerdev. This took a long time to review, partially because we spent time asking questions on earlier commits that were addressed later.

I think we'd be most interested in seeing this module use a cleaner architecture for encapsulating instance state.

raw_by_id = _describe_instances(client, instance_ids)
except ClientError as e:
module.fail_json(msg="AWS DescribeInstances failed: {}".format(e))
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return
return None, None

Comment on lines +197 to +206
def _check_no_terminal(formatted, action_state, module):
"""Fail the module if any instance is terminated/shutting-down."""
bad = [(iid, state) for (iid, _raw, state) in formatted if state in TERMINAL_STATES]
if bad:
module.fail_json(msg=(
"Cannot {} terminated/shutting-down instances: {}".format(
action_state, ', '.join('{}={}'.format(i, s) for i, s in bad)
)
))
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment applies to more than just this function, but it is hard to follow the formatted list of tuples getting passed around. The name does not adequately describe what is contained in the data structure. It would be nicer to pass around a list of Instance objects. Having an Instance class could also simplify this code. For example, _describe_instances would then return a dictionary of id: Instance(...) items, and the Instance class would provide attributes to obtain whatever info you need.

Comment on lines +233 to +238
if refresh:
formatted, after_states = _describe_and_format(client, instance_ids, module)
raw_by_id = {iid: raw for (iid, raw, _state) in formatted}
else:
formatted, _ = _describe_and_format(client, instance_ids, module)
raw_by_id = {iid: raw for (iid, raw, _state) in formatted}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense. Regardless of refresh, we call _describe_and_format which calls _describe_instances again. So refresh=False isn't going to be faster, it is just going to call describe again but ignore the after_states.

This appears to be AI generated code that shouldn't be part of a PR up for review.

return formatted, states # list of (id, raw, state); dict id->state


def _check_no_terminal(formatted, action_state, module):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _check_no_terminal(formatted, action_state, module):
def _check_not_terminated(formatted, action_state, module):

We are aware that this checks for both terminated and shutting down states, but this name is easier to understand at a glance.

Comment on lines +258 to +259
targets_needing_start = [iid for (iid, _raw, state) in formatted
if state in ('stopped', 'stopping')]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding constants for these states? This comment applies throughout the PR.

Comment thread tests/test_ec2_instance_state.py Outdated
Comment on lines +62 to +69
patch.object(ec2_instance_state.AnsibleModule, 'exit_json', side_effect=fake_exit), \
patch.object(ec2_instance_state.AnsibleModule, 'fail_json', side_effect=fake_fail):
patch.object(ec2_instance_state.AnsibleModule, 'exit_json', new=fake_exit), \
patch.object(ec2_instance_state.AnsibleModule, 'fail_json', new=fake_fail), \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason for changing from side_effect to new?

def _result_or_emit(module, client, instance_ids, before_states, after_states,
state, changed, skipped, refresh, emit):
"""Build the result payload. If emit is True, call exit_json; else return the dict."""
state, changed, skipped, refresh, emit, raw_by_id=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of parameters are being added to this function and it is getting hard to reason about.

Comment on lines +353 to +369
if not wait:
# Phase 1: stop, always wait (required because StartInstances rejects
# still-stopping instances; we must let stop complete before starting).
stop_payload = _do_stop(module, client, instance_ids, wait=True,
timeout=timeout, emit=False)

if not wait and stop_payload['changed']:
module.warn(
"wait=False ignored for the stop phase of 'restarted'; "
"wait=False was overridden to wait=True for the stop phase of 'restarted'; "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We think it would be simpler to not warn and instead document that restart always waits for instances to stop.

self.assertIn('terminated', result['result']['msg'].lower())


class TestCheckMode(unittest.TestCase):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should assert that waiter is not invoked when expected too.

self.assertNotIn('stop_instances', [c[0] for c in fake.calls])


class TestInvalidIdMutating(unittest.TestCase):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a new test class? Should each command have this test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants