Ansible module to start/stop/restart ec2 instances by AmitPhulera · Pull Request #6858 · dimagi/commcare-cloud

AmitPhulera · 2026-04-14T12:02:31Z

https://dimagi.atlassian.net/browse/SAAS-19382

As part of our ongoing efforts to automate the restarts, we would need ansible to have a module to start/stop ec2 services. Having it in ansible as opposed to python scripts would give us the advantage of using it directly into the playbooks instead of calling a python command from ansible.

I have had a boilorplate from where claude started the implementation. After that it asked couple of good questions to reach the final implementation.

I have kept spec and implementation plan in first and second commit but will remove it before merging the PR..

Environments Affected

All

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… chars) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds _describe_instances, _format_instance, _describe_and_format, and _do_describe helpers; wires state == 'described' dispatch in main(). Also adds FakeEC2Client test helper and TestDescribed (4 tests); fixes a one-off typo in the order-preservation test ID (18 hex chars -> 17). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…iter Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…iter Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…arning - _result_or_emit: accept optional raw_by_id param; refresh=False path now uses caller-supplied raw_by_id instead of a second describe call - _do_start / _do_stop check_mode branches: build and pass raw_by_id - _do_restart: move module.warn below stop phase and gate it on stop_payload['changed'] so it only fires when stop actually ran - _do_restart: reuse start_payload['instances'] instead of a third _describe_and_format call; patch previous_state from stop before_states - test: add diff assertion to test_restarted_running_does_stop_then_start Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…+wait=False Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gherceg

Pair reviewed with @millerdev. We reviewed commits before 15354b0

gherceg · 2026-04-21T18:10:26Z

+      target region are picked up from the standard boto3 credential chain;
+      in the commcare-cloud workflow the AWS_PROFILE and AWS_REGION
+      environment variables are exported automatically before ansible runs.
+    - The restarted state is implemented as stop-then-start. The stop phase


Any particular reason to issue a stop-then-start as opposed to a "reboot"? My understanding is that a reboot is faster since it doesn't put the VM on new hardware, but I know there are times where we want to do a full stop-then-start.

This will essentially replace our restart processes which requires us to start and stop the instances rather than issuing reboot. This is because it is required control for APN and also the reason you mentioned of probably getting a better hardware.
I think we can add a ec2_reboot that does that but as now control machine doesn't have permission to do that. So I can probably create a followup PR with the restart permissions and updating the module?

Cool that makes sense. No need to worry about that followup PR until we find that we need it.

Do you think it is worth renaming this state to "stopped and started" just to be clear? This avoids needing to distinguish between restarted and rebooted in the future, and makes it clear what this state is doing without needing to read docs.

gherceg · 2026-04-21T18:16:01Z

+    instance_ids:
+        description: List of EC2 instance IDs to act on.
+        required: true
+        type: list
+        elements: str


Is it possible to make this more flexible? Could it accept a name for "group" or "hostname" and translate that to the instance id(s) under the hood?

It is possible but this is just the module and there will be a wrapper command that will call the module which will do the translation. What I am envisioning is a command like

cchq <env> ec2_stop <instance_id|host|host_group>

Do you think that should be enough or do you think there is a requirement that would need us to include the group and hostname in the module itself?

fwiw there is an example on how to do use host or host_group when you try to call it from a playbook.

It seems ideal to put all logic related to translating groups/hosts to instance ids inside the module. Do we need the ability to pass in instance ids to this module, or would ansible names (groups/hosts) be sufficient?

gherceg · 2026-04-21T18:21:57Z

+def _get_ec2_client(region):
+    """Return a boto3 EC2 client. Defined as a module-level function so tests can patch it."""
+    try:
+        import boto3


Is this typically installed with ansible or is it a special requirement for this module?

This is a special requirement for the module but given this module will always be called from cchq venv boto3 should be present.

boto3 is imported in this way that it can be independently called as well and should elegantly exit if dependency is not there.

gherceg · 2026-04-21T18:28:01Z

+LIBRARY_DIR = os.path.abspath(os.path.join(
+    os.path.dirname(__file__), '..', 'src', 'commcare_cloud', 'ansible', 'library'
+))
+TESTS_DIR = os.path.abspath(os.path.dirname(__file__))


nit: Can you define TESTS_DIR before LIBRARY_DIR and reference TESTS_DIR in the construction of LIBRARY_DIR?

gherceg · 2026-04-21T18:40:28Z

+        self.assertTrue(result['failed'])
+        self.assertIn('region', result['result']['msg'].lower())


nit: use pytest simple assertions (assert result['failed'], assert 'region' in result...)

gherceg · 2026-04-21T18:41:17Z

+            'instance_ids': ['i-0123456789abcdef0'],
+            'state': 'bogus',
+        })
+        self.assertTrue(result['failed'])


This should also assert that the message is as expected similar to the other tests.

gherceg · 2026-04-21T18:48:45Z

+        if raw is None:
+            module.fail_json(msg="Instance {} missing from DescribeInstances response.".format(iid))


Is it possible to fail just for this instance rather than the whole group?

gherceg · 2026-04-21T18:50:59Z

+    }
+
+
+def _describe_and_format(client, instance_ids, module):


Could you rename "format" in this context to avoid conflicting with the _format_instance helper defined above? We expected to see that called in here.

gherceg · 2026-04-21T19:11:27Z

+    def test_described_terminated_is_reported_not_failed(self):
+        fake = FakeEC2Client(instances_by_id={
+            'i-0123456789abcdef0': {'state': 'terminated'},
+        })
+        result = run_module(
+            {'instance_ids': ['i-0123456789abcdef0'], 'state': 'described'},
+            fake_client=fake,
+        )
+        self.assertFalse(result['failed'])
+        self.assertEqual(result['result']['instances'][0]['current_state'], 'terminated')


Why do we need this test? We don't validate the described response so why would we expect terminated to be special?

gherceg

Paired again with @millerdev. This took a long time to review, partially because we spent time asking questions on earlier commits that were addressed later.

I think we'd be most interested in seeing this module use a cleaner architecture for encapsulating instance state.

gherceg · 2026-04-22T18:29:12Z

        raw_by_id = _describe_instances(client, instance_ids)
    except ClientError as e:
        module.fail_json(msg="AWS DescribeInstances failed: {}".format(e))
+        return


Suggested change

return

return None, None

gherceg · 2026-04-22T18:38:16Z

+def _check_no_terminal(formatted, action_state, module):
+    """Fail the module if any instance is terminated/shutting-down."""
+    bad = [(iid, state) for (iid, _raw, state) in formatted if state in TERMINAL_STATES]
+    if bad:
+        module.fail_json(msg=(
+            "Cannot {} terminated/shutting-down instances: {}".format(
+                action_state, ', '.join('{}={}'.format(i, s) for i, s in bad)
+            )
+        ))
+        return


This comment applies to more than just this function, but it is hard to follow the formatted list of tuples getting passed around. The name does not adequately describe what is contained in the data structure. It would be nicer to pass around a list of Instance objects. Having an Instance class could also simplify this code. For example, _describe_instances would then return a dictionary of id: Instance(...) items, and the Instance class would provide attributes to obtain whatever info you need.

gherceg · 2026-04-22T18:50:46Z

+    if refresh:
+        formatted, after_states = _describe_and_format(client, instance_ids, module)
+        raw_by_id = {iid: raw for (iid, raw, _state) in formatted}
+    else:
+        formatted, _ = _describe_and_format(client, instance_ids, module)
+        raw_by_id = {iid: raw for (iid, raw, _state) in formatted}


This doesn't make sense. Regardless of refresh, we call _describe_and_format which calls _describe_instances again. So refresh=False isn't going to be faster, it is just going to call describe again but ignore the after_states.

This appears to be AI generated code that shouldn't be part of a PR up for review.

gherceg · 2026-04-22T19:01:01Z

    return formatted, states  # list of (id, raw, state); dict id->state


+def _check_no_terminal(formatted, action_state, module):


Suggested change

def _check_no_terminal(formatted, action_state, module):

def _check_not_terminated(formatted, action_state, module):

We are aware that this checks for both terminated and shutting down states, but this name is easier to understand at a glance.

gherceg · 2026-04-22T19:01:51Z

+    targets_needing_start = [iid for (iid, _raw, state) in formatted
+                             if state in ('stopped', 'stopping')]


Would it be worth adding constants for these states? This comment applies throughout the PR.

gherceg · 2026-04-22T19:56:40Z

-         patch.object(ec2_instance_state.AnsibleModule, 'exit_json', side_effect=fake_exit), \
-         patch.object(ec2_instance_state.AnsibleModule, 'fail_json', side_effect=fake_fail):
+         patch.object(ec2_instance_state.AnsibleModule, 'exit_json', new=fake_exit), \
+         patch.object(ec2_instance_state.AnsibleModule, 'fail_json', new=fake_fail), \


What was the reason for changing from side_effect to new?

gherceg · 2026-04-22T20:11:00Z

 def _result_or_emit(module, client, instance_ids, before_states, after_states,
-                    state, changed, skipped, refresh, emit):
-    """Build the result payload. If emit is True, call exit_json; else return the dict."""
+                    state, changed, skipped, refresh, emit, raw_by_id=None):


A lot of parameters are being added to this function and it is getting hard to reason about.

gherceg · 2026-04-22T20:13:32Z

-    if not wait:
+    # Phase 1: stop, always wait (required because StartInstances rejects
+    # still-stopping instances; we must let stop complete before starting).
+    stop_payload = _do_stop(module, client, instance_ids, wait=True,
+                            timeout=timeout, emit=False)
+
+    if not wait and stop_payload['changed']:
        module.warn(
-            "wait=False ignored for the stop phase of 'restarted'; "
+            "wait=False was overridden to wait=True for the stop phase of 'restarted'; "


We think it would be simpler to not warn and instead document that restart always waits for instances to stop.

gherceg · 2026-04-22T20:17:54Z

        self.assertIn('terminated', result['result']['msg'].lower())


+class TestCheckMode(unittest.TestCase):


Should assert that waiter is not invoked when expected too.

gherceg · 2026-04-22T20:19:41Z

+        self.assertNotIn('stop_instances', [c[0] for c in fake.calls])
+
+
+class TestInvalidIdMutating(unittest.TestCase):


Does this need to be a new test class? Should each command have this test?

AmitPhulera added 2 commits April 13, 2026 15:27

claude superpower spec for the module implementation

9750a91

claude superpower implmentation plan

b6b0009

AmitPhulera requested review from dannyroberts, gherceg and millerdev April 14, 2026 12:02

AmitPhulera and others added 16 commits April 15, 2026 17:07

Add ec2_instance_state module skeleton with arg validation

d9e5225

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tighten INSTANCE_ID_RE to accept only AWS-issued lengths (8 or 17 hex…

eeae26b

… chars) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ec2_instance_state: minor cleanups from Task 2 review

15354b0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ec2_instance_state: implement 'started' state with idempotency and wa…

e6caf73

…iter Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ec2_instance_state: fail fast on stopping+wait=False start

3b8c9e5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ec2_instance_state: implement 'stopped' state with idempotency and wa…

b9ef994

…iter Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ec2_instance_state: clarify stop pending semantics + add pending-wait…

48b7ae4

… test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ec2_instance_state: implement 'restarted' as stop-then-start

241a427

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ec2_instance_state: tests for check mode, invalid IDs, waiter timeout

35ccff9

ec2_instance_state: tighten test assertions from Task 6 review

534a2be

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ec2_instance_state: fast-path noop describe; always warn on restarted…

7a4e6bd

…+wait=False Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove import from top level as boto may not be installed

17028b8

remove claude superpower docs

3a629c9

lint

e565b50

AmitPhulera force-pushed the ap/ansible-ec2-restart-module branch from 2db5b86 to e565b50 Compare April 15, 2026 12:16

gherceg reviewed Apr 21, 2026

View reviewed changes

gherceg reviewed Apr 22, 2026

View reviewed changes

		self.assertTrue(result['failed'])
		self.assertIn('region', result['result']['msg'].lower())

		if raw is None:
		module.fail_json(msg="Instance {} missing from DescribeInstances response.".format(iid))

		return formatted, states # list of (id, raw, state); dict id->state


		def _check_no_terminal(formatted, action_state, module):

	def _check_no_terminal(formatted, action_state, module):
	def _check_not_terminated(formatted, action_state, module):

		targets_needing_start = [iid for (iid, _raw, state) in formatted
		if state in ('stopped', 'stopping')]

		self.assertIn('terminated', result['result']['msg'].lower())


		class TestCheckMode(unittest.TestCase):

		self.assertNotIn('stop_instances', [c[0] for c in fake.calls])


		class TestInvalidIdMutating(unittest.TestCase):

Conversation

AmitPhulera commented Apr 14, 2026

Environments Affected

Uh oh!

gherceg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gherceg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gherceg left a comment •

edited

Loading