Draft
Conversation
23d90b4 to
98abcd7
Compare
bc86294 to
e3acd5a
Compare
4 tasks
4 tasks
We will use a single-level JSON for algorithm selection including device-specific algorithms. Remove the collective ADI for now. We'll add the mechanism of selecting device-level algorithms later. gen_coll.py is updated to skip calling MPID_ collectives. Device collective CVARs are removed.
We will add the mechanism of selecting device-layer algorithms later.
Temporarily comment out the composition code that calls netmod/shm collectives since we will remove these apis next. Some NULL composition functions are removed.
We will replace the device-algorithm selelction later at MPIR-layer.
The auto selection should take care of restrictions. Error rather than fallback. If user use CVAR to select specific algorithm, we should check restrictions before jumping the the algorithm. We will design a common fallback handling there.
In addition to prototypes for various algorithm function, generate enums and structs in coll_algos.h as well. Assume add_prototype won't be called redundantly and keep it simple.
Generate constants for enum MPIR_Csel_coll_type.
Note that we split the entries between intra and inter, e.g.
MPIR_CSEL_COLL_TYPE__INTRA_BCAST,
MPIR_CSEL_COLL_TYPE__INTRA_IBCAST,
MPIR_CSEL_COLL_TYPE__INTER_BCAST,
MPIR_CSEL_COLL_TYPE__INTER_IBCAST
Temporary patch code to keep the old code buildable.
Define MPIR_Csel_node_s and generate enum MPII_Csel_container_type, which defines the list of algorithm id constants. MPIR_Csel_node_s will replace the csel_node_s.
Update load_coll_algos to load coll_algorithms.txt with a conditions section. Every the condition maps to a condition function. Also generate G.algo_list as a flat list so we can dump the table of algorithms and generate sequential algorithm IDs.
Remove the optional validate_tree and print_tree to facilitate trasitioning to auto generated parsing routines. We will add back the debug print routine later.
Add back the debug print routine.
Replace hard coded parsing routines with autogenerated lookup table and subrountines including MPIR_Coll_algo_names, MPII_Csel_parse_container_params, MPII_Csel_parse_operator, and MPII_Csel_run_condition. Simplify MPIR_Csel_node_s and MPIR_Csel_node_type_e. The auto-generation from coll_algorithms.txt is in the later commits.
These routines are replaced by condition functions (see previous commits).
Dump a wrapper function for each algorithm that takes (cont, coll_sig). Separately Declare algorithm prototypes. Separately Decleare sched_auto prototypes.
Generate collective implement functions that assemble coll_sig and call MPIR_Coll_auto. Remove or replace the old MPIR_Xxx_impl and MPIR_Xxx_allcomm_auto interfaces. Their original functions, CVAR selection and JSON selection, are now in MPIR_Coll_auto.
Current compositional algorithms call MPIR collectives. We will refactor them later. But for now, generate a wrapper MPIR functions that calls _impl functions.
Add MPIR_init_coll_sig and MPID_init_coll_sig so we can add arbitrary attr bits or additional fields without hacking maint/gen_coll.py.
Generate those IDs, table entries, and json parsing from coll_algorithms.txt.
They are replaced by MPIR_Coll_nb.
In coll_algorithms.txt, add "inline" attribute to skip add prototype for the corresponding algorithm function since it is inlined in the headers. Add "func_name" to directly specify algorithm function name. Add "macro_guard" to specify a preproc condition for the algorithm function. For example, the ch4 posix algorithm function needs be protected by "#if defined(MPIDI_CH4_SHM_POSIX)" (to be defined).
Add conditional condition - the condition function only can be called inside preprocess macro guard. We need generate another header file, coll_autogen.h, that are loaded after mpidpos.h. "coll_algos.h" goes into mpir_coll.h, which is included in between mpidpre.h and mpidpost.h. Refactor a bit so all the conditions parsing logics are wrapped in functions such as get_conditon_name, get_condition_func, etc. and they are defined together.
Sometime we may want to do differently between restriction-check and condition check. For example, algorithm like release_gather normally gets selelcted only after user calls the collective certain number of times. But if user selects the algorithm by CVAR, it won't make sense to do this repeat check in the restriction-check.
Rather than add individual boolean flags, use bit mask "flags" instead. It is easier to make sure we zero-initialize all the flags that way.
Provide a simple mechanism for a rank to dump collective algorithm counters. Set MPIR_CVAR_DUMP_COLL_ALGO_COUNTERS to the global rank of the process that we want it to dump since it is undesirable for every process to dump yet it does not always makes sense for rank 0 to dump especially when we don't always use comm world. It is counted in the CSEL framework so internal collectives are not counted when we internally use _fallback algorithms.
Enable CVARs and JSONs to select ch4-posix layer release_gather algorithms. Select MPIDI_POSIX_mpi_bcast_release_gather if it passes MPIDI_CH4_release_gather condition check, which only passes if comm is an posix intranode comm.
Extend the previous commit to activate release_gather algorithm for reduce, allreduce, and barrier.
Remove MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE. It is now replaced with MPIR_CVAR_COLL_SELECTION_JSON_FILE. Although we could reuse the same CVAR name, but since we altered the syntax of JSON, using a different name prevents potential confusion.
Parse the json as a list of named subtrees such as:
{
"name=main": {...},
"name=bcast-intra-auto": {...},
...
}
Inside the subtree, we can refer to the named subtree using "call=name".
If the json does not contain named subtrees, treat it as a single tree
with the name "main".
Load src/mpi/coll/coll_selection.json as named subtrees. Add MPIR_Coll_run_tree which runs the selection on a subtree. Replace MPIR_Coll_auto with MPIR_Coll_json, and add MPIR_Coll_run_tree(csel_tree_auto, coll_sig) to allow recursive algorithms such as compositional algorithms. csel_tree_auto will fallback to csel_tree_main if it is not defined in the json file. But similarly, we can easily introduce more predefined subtree later, e.g. bcast-intra-auto etc. In CVAR selection, the "auto" should be default and value should be 0. Thus it should automatically fallthrough and run on csel_tree_main.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Description
coll_algorithms.txtcatalogs all collective algorithms and conditionscoll_selection.jsonspecifies decision treeMPIR_CVAR_DUMP_COLL_ALGO_COUNTERSfor debug summaryMPIR_CVAR_DUMP_COLL_ALGO_COUNTERS[skip warnings]
Discussion
Reference: #7544
Also see comments in #7598 and #7666
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.