-
Couldn't load subscription status.
- Fork 931
Fixing task ID replacement for MNP jobs on AWS Batch #2574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| self._task_id.replace("control-", "") | ||
| + "-node-$AWS_BATCH_JOB_NODE_INDEX", | ||
| ) | ||
| # Fix: Only replace task ID in specific arguments, not in environment variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the fix! @saikonen any suggestions on approaches that might allow us to not lean on string substitution (quite finicky)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't touched this implementation in a bit so maybe there is a reason that the task-id patterns are kept in execute() but at first glance this seems (in the original implementation) like quite a late phase to be doing anything with the command.
Could a better place be when the cmd is not yet joined into a string, when we can operate on options separately? f.ex. batch_cli.py#251 and step_functions.py#925 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure of the flow but I think that is possible if the flag --task-id is the only thing that needs to be changed. Looking at the command, I'm not sure if there is something else that requires us to modify the task ID (ex. MF_PATHSPEC). If task-id is the only thing, then I can definitely make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the code. I think it is difficult to fully rely on batch_cli.py#251 because we don't have access to $AWS_BATCH_JOB_NODE_INDEX as this would only be available at runtime for a given worker MNP node, I believe. But I've added more placeholders in batch_cli.py#251 that should make the regex more reliable. I'm less familiar with the step_functions file as I'm not using that right now.
Is it possible to get approval on this soon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also @savingoyal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumping this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, just wanted to check in on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry again for the long wait. I'm finally able to put some hours into this and will check out the changes today.
…place with node-index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functionally this PR is now working as expected. Had some suggestions for cleanup
| # Set the ulimit of number of open files to 65536. This is because we cannot set it easily once worker processes start on Batch. | ||
| # job_definition["containerProperties"]["linuxParameters"]["ulimits"] = [ | ||
| # { | ||
| # "name": "nofile", | ||
| # "softLimit": 65536, | ||
| # "hardLimit": 65536, | ||
| # } | ||
| # ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be cleaned up?
| # Prefer the task role by default when running inside AWS Batch containers | ||
| # by temporarily removing higher-precedence env credentials for this process. | ||
| # This avoids AMI-injected AWS_* env vars from overriding the task role. | ||
| # Outside of Batch, we leave env vars untouched unless explicitly opted-in. | ||
| if "AWS_BATCH_JOB_ID" in os.environ: | ||
| _aws_env_keys = [ | ||
| "AWS_ACCESS_KEY_ID", | ||
| "AWS_SECRET_ACCESS_KEY", | ||
| "AWS_SESSION_TOKEN", | ||
| "AWS_PROFILE", | ||
| "AWS_DEFAULT_PROFILE", | ||
| ] | ||
| _present = [k for k in _aws_env_keys if k in os.environ] | ||
| print( | ||
| "[Metaflow] AWS credential-related env vars present before Batch client init:", | ||
| _present, | ||
| ) | ||
| _saved_env = { | ||
| k: os.environ.pop(k) for k in _aws_env_keys if k in os.environ | ||
| } | ||
| try: | ||
| self._client = get_aws_client("batch") | ||
| finally: | ||
| # Restore prior env for the rest of the process | ||
| for k, v in _saved_env.items(): | ||
| os.environ[k] = v | ||
| else: | ||
| self._client = get_aws_client("batch") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change relevant to the batch parallel issue, or something different? the PR seems to work fine without this part as well
| else: | ||
| # Fallback in case of unexpected format | ||
| run_id, step, task_id = self.run_id, step_name, parts[-1] | ||
| tds = TaskDataStore( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
task_pre_step actually binds a task datastore to self. Would that work as well?
| flow, step_name, last_completion_timeout | ||
| ) | ||
|
|
||
| def _wait_for_mapper_tasks_batch_api( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: _wait_for_mapper_tasks_datastore as the other one is suffixed _metadata
| if os.environ.get("METAFLOW_DEBUG_BATCH_POLL") in ( | ||
| "1", | ||
| "true", | ||
| "True", | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary to have as a debug flag? were there a lot of datastore errors encountered to necessitate this? I'd opt to either printing out the error always, or never depending on how relevant it is for the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively we can revisit this separately and consider adding a proper debug flag for batch
If we aren't using the Metaflow metadata service provider, Metaflow defaults to generating task IDs locally. But these task IDs are just simple integers based on how many tasks/steps there are and are sequentially incremented based on
new_task_idinmetaflow/plugins/metadata_providers/local.py. This presents a problem when we're doing AWS Batch MNP, since currently we try and mass replace based on the task ID in the secondary command. If this is a simple integer, this will replace many erroneous places.For example, if the task ID is "3", there could be many instances of "3" in the secondary command that then have many replacements with "-node-$AWS_BATCH_JOB_NODE_INDEX" when really we just want to replace the actual task ID.
Here, I've identified two places - the input task ID via
--task-idand the task ID inMF_PATHSPEC, that should be the only two places in the command that have the actual task ID in them that need replacing. It is better to have more specific regexes this way.Furthermore, if there is no metadata provider, I've added a new check for control MNP jobs to finish by checking the S3 datastore instead.