-
Notifications
You must be signed in to change notification settings - Fork 61
Fix reinitialization issues when using total forces in NAMD #887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
5724424 to
d54b6e1
Compare
|
|
d54b6e1 to
ce0385b
Compare
|
|
|
While commit 8cc8ef9 seems to fix the problem, in local testing I still get a crash from |
|
Commit 8cc8ef9 should have solved the crash in #891. As I have said, while the crash doesn't appear anymore on Github CI, I still get another crash in my local test with
diff --git a/src/GlobalMasterServer.C b/src/GlobalMasterServer.C
index 8988998b..a525f73a 100644
--- a/src/GlobalMasterServer.C
+++ b/src/GlobalMasterServer.C
@@ -468,6 +468,9 @@ int GlobalMasterServer::callClients() {
if (positions[i].atomID == rqAtoms[j]){
clientAtomPositions.add(receivedAtomPositions[positions[i].index]);
clientAtomIDs.add(rqAtoms[j]);
+ if (master->requestedTotalForces()) {
+ clientReceivedForceIDs.add(rqAtoms[j]);
+ }
--i; // rqAtoms may contain repeats
if ( ++j == num_atoms_requested ) break;
}
@@ -498,7 +501,7 @@ int GlobalMasterServer::callClients() {
gtf_i,gtf_i+(numForceSenders?master->old_num_groups_requested:0),
goi_i, goi_e, gov_i, gov_e,
forced_atoms_i,forced_atoms_e,forces_i,
- receivedForceIDs.begin(),receivedForceIDs.end(),receivedTotalForces.begin());
+ mf_i, mf_e, receivedTotalForces.begin());
NAMD_EVENT_STOP(1, NamdProfileEvent::GM_PROC_DATA);
a_i = receivedAtomIDs.begin();
p_i = receivedAtomPositions.begin();But then (i) the list of total-force atom IDs is sorted the same as the list of normal atom IDs requesting positions, and this changes the numerical results of total forces and fails the tests requiring total forces, and (ii) I don't know if Colvars needs to "introspect" the forces from TCL forces. In other words, I don't know if this is what Colvars want;
while ((i < this->size()) && (stream >> (*this)[i])) {@giacomofiorin Could you look into the issues above? |
I have worked out a NAMD patch for the second issue that sorts the total forces array correctly and therefore does not have the incorrect numerical results: diff --git a/src/GlobalMasterServer.C b/src/GlobalMasterServer.C
index 8988998b..d04b6a5e 100644
--- a/src/GlobalMasterServer.C
+++ b/src/GlobalMasterServer.C
@@ -442,6 +442,30 @@ int GlobalMasterServer::callClients() {
}
sort(positions.begin(), positions.end(), atomID_less());
+ // Same as above, we need a vector to sort the atom IDs requiring total forces
+ vector <position_index> total_forces;
+ // Check if any of the clients requires total forces
+ bool total_forces_requested = false;
+ while (m_i != m_e) {
+ GlobalMaster *master = *m_i;
+ if (master->requestedTotalForces()) {
+ total_forces_requested = true;
+ break;
+ }
+ m_i++;
+ }
+ if (total_forces_requested) {
+ for (int j = 0; f_i != f_e; ++f_i, ++j) {
+ position_index temp;
+ temp.atomID = *f_i;
+ temp.index = j;
+ total_forces.push_back(temp);
+ }
+ sort(total_forces.begin(), total_forces.end(), atomID_less());
+ }
+
+ // Reset the client pointer;
+ m_i = clientList.begin();
/* call each of the masters with the coordinates */
while(m_i != m_e) {
int num_atoms_requested, num_groups_requested, num_gridobjs_requested;
@@ -474,6 +498,36 @@ int GlobalMasterServer::callClients() {
}
if ( j != num_atoms_requested ) NAMD_bug(
"GlobalMasterServer::callClients() did not find all requested atoms");
+ if (master->requestedTotalForces()) {
+ int k = 0;
+ // FIXME: There's no reliable way to check if this is really the first timestep
+ // At the first time step, the total forces may not be ready
+ const bool total_forces_ready = receivedTotalForces.size() == rqAtoms.size();
+ for (int i = 0; i < total_forces.size(); i++) {
+ if (total_forces[i].atomID == rqAtoms[k]){
+ clientReceivedForceIDs.add(rqAtoms[k]);
+ if (total_forces_ready) {
+ clientReceivedTotalForces.add(receivedTotalForces[total_forces[i].index]);
+ }
+ --i; // rqAtoms may contain repeats
+ if ( ++k == num_atoms_requested ) break;
+ }
+ }
+ // FIXME: For Colvars, it is possible to do
+ // cv configfile xxx.colvars
+ // run 20
+ // structure yyy.pdb
+ // cv reset
+ // cv configfile yyy.colvars
+ // run 20
+ // The problem arises when "yyy.colvars" and "xxx.colvars" select different sets of
+ // atoms for total forces. There is no good way to check if this is the first step and
+ // total forces should be skipped here (but Colvars would check it anyway and discard
+ // the invalid result).
+ // if ( total_forces_ready && k != num_atoms_requested ) {
+ // NAMD_bug("GlobalMasterServer::callClients() did not find all requested total forces");
+ // }
+ }
}
AtomIDList::iterator ma_i = clientAtomIDs.begin();
@@ -498,7 +552,7 @@ int GlobalMasterServer::callClients() {
gtf_i,gtf_i+(numForceSenders?master->old_num_groups_requested:0),
goi_i, goi_e, gov_i, gov_e,
forced_atoms_i,forced_atoms_e,forces_i,
- receivedForceIDs.begin(),receivedForceIDs.end(),receivedTotalForces.begin());
+ mf_i, mf_e, mtf_i);
NAMD_EVENT_STOP(1, NamdProfileEvent::GM_PROC_DATA);
a_i = receivedAtomIDs.begin();
p_i = receivedAtomPositions.begin();But I don't know if Colvars really needs it. |
I would like to have a base-class function for the atoms map, because it's not a NAMD-specific issue. LAMMPS and GROMACS are not using it yet, but simply because they lack this optimization.
Ask the other folks at UIUC (including Jim), and you will find no fans of GlobalMaster's multiplexed design. But the alternative would have been multiple rounds of reduction/broadcast, and I think at the time it was not only more coding work but also potentially harmful to multinode performance.
I don't think we know what each person using Colvars really wants, but I agree that that's a very unlikely use case.
Well, you did that just hours after |
|
Editing the PR's title and description for merging. Will close #891 |
I didn't use
I have tried to solve the issue and revise the NAMD patch that I posted here in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486. Could you take a look at that NAMD MR to see if Colvars really needs it to work correctly? |
It's not just because of the multiple GlobalMaster objects, but also because the messages received by GlobalMasterServer are not in order, either by task ID or by atom. Typical solutions to this would be sorting the atom buffer to match the order that Colvars expects, or using a lookup table. I am less familiar with the CUDAGM code in the NAMD repo than with the interface code that you have added here in the Colvars repo, but it seems that you are effectively sorting the atomic coordinates as you gather them? I should also have been clearer on the other backends:
|
Doing just that. I think it mostly depends on what the other NAMD folks will say. But if we can assume that the only atoms and forces that GlobalMasterColvars sees are the ones it requested, that would be an improvement. |
Given this open MR, we maybe wait a few days to merge this? Or merge it now (after including patches here to reproduce the NAMD changes that you're already submitting in that MR)? |
This PR currently doesn't completely solve the crash issue. Even after this one and https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486 get merged, I still have no idea how to re-initialize the |
You mean that there currently is no callback to Colvars to state that a new (I can currently run that test successfully on my local machine even without the changes in your NAMD MR...) |
I'm not sure. Maybe you can make any one of
It does run with the default compilation flags with optimizations, but doesn't work if you use |
Yeah, that's definitely required, I'll do that. The flip side of it is catching the use case when the system's topology changes while Colvars is still defined. That may need a change in NAMD. |
8cc8ef9 to
1cdfe8d
Compare
|
The fix looks good! I wish I had seen the bug... Do you think a variant of the |
I have no earthly idea! I couldn't see the bug locally either, I just took @HanatoK's word for it... |
Let me try to push a commit to force the boundary check and call |
|
Hi @jhenin and @giacomofiorin ! Could you try bc27da1 locally? The ARM64 build currently skip the |
|
Indeed! This does reveal the crash, and therefore the fix. Now I wish I had an ARM64 test box to investigate - do you? |
I have one, but the issue now is that on ARM64, the total force is |
34c4525 to
7115a47
Compare
Now that we are overriding the GlobalMaster class functions, we can pass protected members through a const accessor
This is just to trigger the bug on CI. You can revert this commit later after solving the issue.
This might to solve the resetting and reloading crash in the NAMD interface.
Currently spiff has no way to ignore the comparison of nan and -nan.
…orce is zero after collection
invalid This commit simplifies the logic of resetting the total forces by just checking if they are available before accumulating them in collect_cvc_total_forces. This can ensure that ft_reported is not reset after calc_colvar_properties. This commit also simplifies the GPU code path by just reusing the check in collect_cvc_total_forces.
55a4ffe to
1e3d569
Compare
Remove the check (see Colvars#887 (comment))
|
Just rebase this PR against the master branch since #881 has been just merged. |
Remove the check (see Colvars#887 (comment))
| void colvarproxy::set_total_forces_invalid() | ||
| { | ||
| std::fill(atoms_total_forces.begin(), atoms_total_forces.end(), cvm::rvector(0.0, 0.0, 0.0)); | ||
| std::fill(atom_groups_total_forces.begin(), atom_groups_total_forces.end(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's necessary to clear the buffers. It looks fine even if the GPU buffers are not cleared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you expect a performance gain by not clearing them? Is there a risk that it would become necessary due to upstream changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either clearing or not is OK. I believe set_total_forces_invalid is not called very frequently so the performance is an issue.
(2025-12-09: rewritten description - @giacomofiorin)
Fixes two bugs triggered by the (relatively infrequent) combination of features when total forces are being used in a simulation where the system's topology is changed mid-run:
nanin the output, due to a division by zero because not all Colvars properties had been set previously)Fixes #778