Ansible module to start/stop/restart ec2 instances#6858
Ansible module to start/stop/restart ec2 instances#6858AmitPhulera wants to merge 18 commits intomasterfrom
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… chars) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds _describe_instances, _format_instance, _describe_and_format, and _do_describe helpers; wires state == 'described' dispatch in main(). Also adds FakeEC2Client test helper and TestDescribed (4 tests); fixes a one-off typo in the order-preservation test ID (18 hex chars -> 17). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…iter Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…iter Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…arning - _result_or_emit: accept optional raw_by_id param; refresh=False path now uses caller-supplied raw_by_id instead of a second describe call - _do_start / _do_stop check_mode branches: build and pass raw_by_id - _do_restart: move module.warn below stop phase and gate it on stop_payload['changed'] so it only fires when stop actually ran - _do_restart: reuse start_payload['instances'] instead of a third _describe_and_format call; patch previous_state from stop before_states - test: add diff assertion to test_restarted_running_does_stop_then_start Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…+wait=False Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2db5b86 to
e565b50
Compare
gherceg
left a comment
There was a problem hiding this comment.
Pair reviewed with @millerdev. We reviewed commits before 15354b0
| target region are picked up from the standard boto3 credential chain; | ||
| in the commcare-cloud workflow the AWS_PROFILE and AWS_REGION | ||
| environment variables are exported automatically before ansible runs. | ||
| - The restarted state is implemented as stop-then-start. The stop phase |
There was a problem hiding this comment.
Any particular reason to issue a stop-then-start as opposed to a "reboot"? My understanding is that a reboot is faster since it doesn't put the VM on new hardware, but I know there are times where we want to do a full stop-then-start.
There was a problem hiding this comment.
This will essentially replace our restart processes which requires us to start and stop the instances rather than issuing reboot. This is because it is required control for APN and also the reason you mentioned of probably getting a better hardware.
I think we can add a ec2_reboot that does that but as now control machine doesn't have permission to do that. So I can probably create a followup PR with the restart permissions and updating the module?
There was a problem hiding this comment.
Cool that makes sense. No need to worry about that followup PR until we find that we need it.
There was a problem hiding this comment.
Do you think it is worth renaming this state to "stopped and started" just to be clear? This avoids needing to distinguish between restarted and rebooted in the future, and makes it clear what this state is doing without needing to read docs.
| instance_ids: | ||
| description: List of EC2 instance IDs to act on. | ||
| required: true | ||
| type: list | ||
| elements: str |
There was a problem hiding this comment.
Is it possible to make this more flexible? Could it accept a name for "group" or "hostname" and translate that to the instance id(s) under the hood?
There was a problem hiding this comment.
It is possible but this is just the module and there will be a wrapper command that will call the module which will do the translation. What I am envisioning is a command like
cchq <env> ec2_stop <instance_id|host|host_group>
Do you think that should be enough or do you think there is a requirement that would need us to include the group and hostname in the module itself?
fwiw there is an example on how to do use host or host_group when you try to call it from a playbook.
There was a problem hiding this comment.
It seems ideal to put all logic related to translating groups/hosts to instance ids inside the module. Do we need the ability to pass in instance ids to this module, or would ansible names (groups/hosts) be sufficient?
| def _get_ec2_client(region): | ||
| """Return a boto3 EC2 client. Defined as a module-level function so tests can patch it.""" | ||
| try: | ||
| import boto3 |
There was a problem hiding this comment.
Is this typically installed with ansible or is it a special requirement for this module?
There was a problem hiding this comment.
This is a special requirement for the module but given this module will always be called from cchq venv boto3 should be present.
boto3 is imported in this way that it can be independently called as well and should elegantly exit if dependency is not there.
| LIBRARY_DIR = os.path.abspath(os.path.join( | ||
| os.path.dirname(__file__), '..', 'src', 'commcare_cloud', 'ansible', 'library' | ||
| )) | ||
| TESTS_DIR = os.path.abspath(os.path.dirname(__file__)) |
There was a problem hiding this comment.
nit: Can you define TESTS_DIR before LIBRARY_DIR and reference TESTS_DIR in the construction of LIBRARY_DIR?
| self.assertTrue(result['failed']) | ||
| self.assertIn('region', result['result']['msg'].lower()) |
There was a problem hiding this comment.
nit: use pytest simple assertions (assert result['failed'], assert 'region' in result...)
| 'instance_ids': ['i-0123456789abcdef0'], | ||
| 'state': 'bogus', | ||
| }) | ||
| self.assertTrue(result['failed']) |
There was a problem hiding this comment.
This should also assert that the message is as expected similar to the other tests.
| if raw is None: | ||
| module.fail_json(msg="Instance {} missing from DescribeInstances response.".format(iid)) |
There was a problem hiding this comment.
Is it possible to fail just for this instance rather than the whole group?
| } | ||
|
|
||
|
|
||
| def _describe_and_format(client, instance_ids, module): |
There was a problem hiding this comment.
Could you rename "format" in this context to avoid conflicting with the _format_instance helper defined above? We expected to see that called in here.
| def test_described_terminated_is_reported_not_failed(self): | ||
| fake = FakeEC2Client(instances_by_id={ | ||
| 'i-0123456789abcdef0': {'state': 'terminated'}, | ||
| }) | ||
| result = run_module( | ||
| {'instance_ids': ['i-0123456789abcdef0'], 'state': 'described'}, | ||
| fake_client=fake, | ||
| ) | ||
| self.assertFalse(result['failed']) | ||
| self.assertEqual(result['result']['instances'][0]['current_state'], 'terminated') |
There was a problem hiding this comment.
Why do we need this test? We don't validate the described response so why would we expect terminated to be special?
There was a problem hiding this comment.
Paired again with @millerdev. This took a long time to review, partially because we spent time asking questions on earlier commits that were addressed later.
I think we'd be most interested in seeing this module use a cleaner architecture for encapsulating instance state.
| raw_by_id = _describe_instances(client, instance_ids) | ||
| except ClientError as e: | ||
| module.fail_json(msg="AWS DescribeInstances failed: {}".format(e)) | ||
| return |
There was a problem hiding this comment.
| return | |
| return None, None |
| def _check_no_terminal(formatted, action_state, module): | ||
| """Fail the module if any instance is terminated/shutting-down.""" | ||
| bad = [(iid, state) for (iid, _raw, state) in formatted if state in TERMINAL_STATES] | ||
| if bad: | ||
| module.fail_json(msg=( | ||
| "Cannot {} terminated/shutting-down instances: {}".format( | ||
| action_state, ', '.join('{}={}'.format(i, s) for i, s in bad) | ||
| ) | ||
| )) | ||
| return |
There was a problem hiding this comment.
This comment applies to more than just this function, but it is hard to follow the formatted list of tuples getting passed around. The name does not adequately describe what is contained in the data structure. It would be nicer to pass around a list of Instance objects. Having an Instance class could also simplify this code. For example, _describe_instances would then return a dictionary of id: Instance(...) items, and the Instance class would provide attributes to obtain whatever info you need.
| if refresh: | ||
| formatted, after_states = _describe_and_format(client, instance_ids, module) | ||
| raw_by_id = {iid: raw for (iid, raw, _state) in formatted} | ||
| else: | ||
| formatted, _ = _describe_and_format(client, instance_ids, module) | ||
| raw_by_id = {iid: raw for (iid, raw, _state) in formatted} |
There was a problem hiding this comment.
This doesn't make sense. Regardless of refresh, we call _describe_and_format which calls _describe_instances again. So refresh=False isn't going to be faster, it is just going to call describe again but ignore the after_states.
This appears to be AI generated code that shouldn't be part of a PR up for review.
| return formatted, states # list of (id, raw, state); dict id->state | ||
|
|
||
|
|
||
| def _check_no_terminal(formatted, action_state, module): |
There was a problem hiding this comment.
| def _check_no_terminal(formatted, action_state, module): | |
| def _check_not_terminated(formatted, action_state, module): |
We are aware that this checks for both terminated and shutting down states, but this name is easier to understand at a glance.
| targets_needing_start = [iid for (iid, _raw, state) in formatted | ||
| if state in ('stopped', 'stopping')] |
There was a problem hiding this comment.
Would it be worth adding constants for these states? This comment applies throughout the PR.
| patch.object(ec2_instance_state.AnsibleModule, 'exit_json', side_effect=fake_exit), \ | ||
| patch.object(ec2_instance_state.AnsibleModule, 'fail_json', side_effect=fake_fail): | ||
| patch.object(ec2_instance_state.AnsibleModule, 'exit_json', new=fake_exit), \ | ||
| patch.object(ec2_instance_state.AnsibleModule, 'fail_json', new=fake_fail), \ |
There was a problem hiding this comment.
What was the reason for changing from side_effect to new?
| def _result_or_emit(module, client, instance_ids, before_states, after_states, | ||
| state, changed, skipped, refresh, emit): | ||
| """Build the result payload. If emit is True, call exit_json; else return the dict.""" | ||
| state, changed, skipped, refresh, emit, raw_by_id=None): |
There was a problem hiding this comment.
A lot of parameters are being added to this function and it is getting hard to reason about.
| if not wait: | ||
| # Phase 1: stop, always wait (required because StartInstances rejects | ||
| # still-stopping instances; we must let stop complete before starting). | ||
| stop_payload = _do_stop(module, client, instance_ids, wait=True, | ||
| timeout=timeout, emit=False) | ||
|
|
||
| if not wait and stop_payload['changed']: | ||
| module.warn( | ||
| "wait=False ignored for the stop phase of 'restarted'; " | ||
| "wait=False was overridden to wait=True for the stop phase of 'restarted'; " |
There was a problem hiding this comment.
We think it would be simpler to not warn and instead document that restart always waits for instances to stop.
| self.assertIn('terminated', result['result']['msg'].lower()) | ||
|
|
||
|
|
||
| class TestCheckMode(unittest.TestCase): |
There was a problem hiding this comment.
Should assert that waiter is not invoked when expected too.
| self.assertNotIn('stop_instances', [c[0] for c in fake.calls]) | ||
|
|
||
|
|
||
| class TestInvalidIdMutating(unittest.TestCase): |
There was a problem hiding this comment.
Does this need to be a new test class? Should each command have this test?
https://dimagi.atlassian.net/browse/SAAS-19382
As part of our ongoing efforts to automate the restarts, we would need ansible to have a module to start/stop ec2 services. Having it in ansible as opposed to python scripts would give us the advantage of using it directly into the playbooks instead of calling a python command from ansible.
I have had a boilorplate from where claude started the implementation. After that it asked couple of good questions to reach the final implementation.
I have kept spec and implementation plan in first and second commit but will remove it before merging the PR..
Environments Affected
All