Skip to content

Conversation

@jhenin
Copy link
Member

@jhenin jhenin commented Nov 19, 2025

(2025-12-09: rewritten description - @giacomofiorin)

Fixes two bugs triggered by the (relatively infrequent) combination of features when total forces are being used in a simulation where the system's topology is changed mid-run:

  • Fix a boundary error when accessing atomic total forces from the previous step
  • Fix wrong computation of the total forces after a configuration change (the symptom was a nan in the output, due to a division by zero because not all Colvars properties had been set previously)

Fixes #778

@giacomofiorin
Copy link
Member

giacomofiorin commented Nov 19, 2025

FYI this is also in the integrate_cudagm branch (outdated)

@giacomofiorin
Copy link
Member

giacomofiorin commented Nov 19, 2025

(Replaced with the commit I had before, so that there is no duplicate when we rebase either branch) (outdated)

@jhenin
Copy link
Member Author

jhenin commented Nov 19, 2025

Ah I did not see that branch without a PR. If you'd rather merge the other branch soon, please close this PR and open a PR with integrate_cudagm. (outdated)

@HanatoK
Copy link
Member

HanatoK commented Nov 25, 2025

While commit 8cc8ef9 seems to fix the problem, in local testing I still get a crash from 003_reinitatoms since atoms_map is not correctly re-initialized after cv reset.

@HanatoK
Copy link
Member

HanatoK commented Dec 3, 2025

Commit 8cc8ef9 should have solved the crash in #891. As I have said, while the crash doesn't appear anymore on Github CI, I still get another crash in my local test with 003_reinitatoms, which might occur due to multiple reasons:

  • When initializing atom groups, Colvars has to communicate with the proxy directly (the SOA code init_atom_from_proxy inherits the design from the AOS code). In the case of colvarproxy_namd, this requires that the atoms_map to be initialized with -1, but after cv reset the map is cleared. There is no way to re-initialize atoms_map after colvarproxy_namd::reset() since colvarproxy::add_config and colvarproxy::engine_ready() are not overridable. The only hacky way is to call init_atoms_map() in colvarproxy_namd::reset().
  • I don't know why the atom IDs with total forces enabled in NAMD GM are not filtered out for each client. I could apply the following patch to solve the issue:
diff --git a/src/GlobalMasterServer.C b/src/GlobalMasterServer.C
index 8988998b..a525f73a 100644
--- a/src/GlobalMasterServer.C
+++ b/src/GlobalMasterServer.C
@@ -468,6 +468,9 @@ int GlobalMasterServer::callClients() {
         if (positions[i].atomID == rqAtoms[j]){
           clientAtomPositions.add(receivedAtomPositions[positions[i].index]);
           clientAtomIDs.add(rqAtoms[j]);
+          if (master->requestedTotalForces()) {
+            clientReceivedForceIDs.add(rqAtoms[j]);
+          }
           --i;  // rqAtoms may contain repeats
           if ( ++j == num_atoms_requested ) break;
         }
@@ -498,7 +501,7 @@ int GlobalMasterServer::callClients() {
                         gtf_i,gtf_i+(numForceSenders?master->old_num_groups_requested:0),
                         goi_i, goi_e, gov_i, gov_e,
                         forced_atoms_i,forced_atoms_e,forces_i,
-      receivedForceIDs.begin(),receivedForceIDs.end(),receivedTotalForces.begin());
+                        mf_i, mf_e, receivedTotalForces.begin());
     NAMD_EVENT_STOP(1, NamdProfileEvent::GM_PROC_DATA);
     a_i = receivedAtomIDs.begin();
     p_i = receivedAtomPositions.begin();

But then (i) the list of total-force atom IDs is sorted the same as the list of normal atom IDs requesting positions, and this changes the numerical results of total forces and fails the tests requiring total forces, and (ii) I don't know if Colvars needs to "introspect" the forces from TCL forces. In other words, I don't know if this is what Colvars want;

  • Solving the two issues above is still not enough. There is another crash in
    while ((stream >> (*this)[i]) && (i < this->size())) {

    Here, the boundary should be checked at first:
while ((i < this->size()) && (stream >> (*this)[i])) {

@giacomofiorin Could you look into the issues above?

@HanatoK
Copy link
Member

HanatoK commented Dec 3, 2025

Commit 8cc8ef9 should have solved the crash in #891. As I have said, while the crash doesn't appear anymore on Github CI, I still get another crash in my local test with 003_reinitatoms, which might occur due to multiple reasons:

  • When initializing atom groups, Colvars has to communicate with the proxy directly (the SOA code init_atom_from_proxy inherits the design from the AOS code). In the case of colvarproxy_namd, this requires that the atoms_map to be initialized with -1, but after cv reset the map is cleared. There is no way to re-initialize atoms_map after colvarproxy_namd::reset() since colvarproxy::add_config and colvarproxy::engine_ready() are not overridable. The only hacky way is to call init_atoms_map() in colvarproxy_namd::reset().
  • I don't know why the atom IDs with total forces enabled in NAMD GM are not filtered out for each client. I could apply the following patch to solve the issue:
diff --git a/src/GlobalMasterServer.C b/src/GlobalMasterServer.C
index 8988998b..a525f73a 100644
--- a/src/GlobalMasterServer.C
+++ b/src/GlobalMasterServer.C
@@ -468,6 +468,9 @@ int GlobalMasterServer::callClients() {
         if (positions[i].atomID == rqAtoms[j]){
           clientAtomPositions.add(receivedAtomPositions[positions[i].index]);
           clientAtomIDs.add(rqAtoms[j]);
+          if (master->requestedTotalForces()) {
+            clientReceivedForceIDs.add(rqAtoms[j]);
+          }
           --i;  // rqAtoms may contain repeats
           if ( ++j == num_atoms_requested ) break;
         }
@@ -498,7 +501,7 @@ int GlobalMasterServer::callClients() {
                         gtf_i,gtf_i+(numForceSenders?master->old_num_groups_requested:0),
                         goi_i, goi_e, gov_i, gov_e,
                         forced_atoms_i,forced_atoms_e,forces_i,
-      receivedForceIDs.begin(),receivedForceIDs.end(),receivedTotalForces.begin());
+                        mf_i, mf_e, receivedTotalForces.begin());
     NAMD_EVENT_STOP(1, NamdProfileEvent::GM_PROC_DATA);
     a_i = receivedAtomIDs.begin();
     p_i = receivedAtomPositions.begin();

But then (i) the list of total-force atom IDs is sorted the same as the list of normal atom IDs requesting positions, and this changes the numerical results of total forces and fails the tests requiring total forces, and (ii) I don't know if Colvars needs to "introspect" the forces from TCL forces. In other words, I don't know if this is what Colvars want;

  • Solving the two issues above is still not enough. There is another crash in

    while ((stream >> (*this)[i]) && (i < this->size())) {

    Here, the boundary should be checked at first:

while ((i < this->size()) && (stream >> (*this)[i])) {

@giacomofiorin Could you look into the issues above?

I have worked out a NAMD patch for the second issue that sorts the total forces array correctly and therefore does not have the incorrect numerical results:

diff --git a/src/GlobalMasterServer.C b/src/GlobalMasterServer.C
index 8988998b..d04b6a5e 100644
--- a/src/GlobalMasterServer.C
+++ b/src/GlobalMasterServer.C
@@ -442,6 +442,30 @@ int GlobalMasterServer::callClients() {
   }
   sort(positions.begin(), positions.end(), atomID_less());
 
+  // Same as above, we need a vector to sort the atom IDs requiring total forces
+  vector <position_index> total_forces;
+  // Check if any of the clients requires total forces
+  bool total_forces_requested = false;
+  while (m_i != m_e) {
+    GlobalMaster *master = *m_i;
+    if (master->requestedTotalForces()) {
+      total_forces_requested = true;
+      break;
+    }
+    m_i++;
+  }
+  if (total_forces_requested) {
+    for (int j = 0; f_i != f_e; ++f_i, ++j) {
+      position_index temp;
+      temp.atomID = *f_i;
+      temp.index = j;
+      total_forces.push_back(temp);
+    }
+    sort(total_forces.begin(), total_forces.end(), atomID_less());
+  }
+
+  // Reset the client pointer;
+  m_i = clientList.begin();
   /* call each of the masters with the coordinates */
   while(m_i != m_e) {
     int num_atoms_requested, num_groups_requested, num_gridobjs_requested;
@@ -474,6 +498,36 @@ int GlobalMasterServer::callClients() {
       }
       if ( j != num_atoms_requested ) NAMD_bug(
         "GlobalMasterServer::callClients() did not find all requested atoms");
+      if (master->requestedTotalForces()) {
+        int k = 0;
+        // FIXME: There's no reliable way to check if this is really the first timestep
+        // At the first time step, the total forces may not be ready
+        const bool total_forces_ready = receivedTotalForces.size() == rqAtoms.size();
+        for (int i = 0; i < total_forces.size(); i++) {
+          if (total_forces[i].atomID == rqAtoms[k]){
+            clientReceivedForceIDs.add(rqAtoms[k]);
+            if (total_forces_ready) {
+              clientReceivedTotalForces.add(receivedTotalForces[total_forces[i].index]);
+            }
+            --i;  // rqAtoms may contain repeats
+            if ( ++k == num_atoms_requested ) break;
+          }
+        }
+        // FIXME: For Colvars, it is possible to do
+        // cv configfile xxx.colvars
+        // run 20
+        // structure yyy.pdb
+        // cv reset
+        // cv configfile yyy.colvars
+        // run 20
+        // The problem arises when "yyy.colvars" and "xxx.colvars" select different sets of
+        // atoms for total forces. There is no good way to check if this is the first step and
+        // total forces should be skipped here (but Colvars would check it anyway and discard
+        // the invalid result).
+        // if ( total_forces_ready && k != num_atoms_requested ) {
+        //   NAMD_bug("GlobalMasterServer::callClients() did not find all requested total forces");
+        // }
+      }
     }
 
     AtomIDList::iterator ma_i = clientAtomIDs.begin();
@@ -498,7 +552,7 @@ int GlobalMasterServer::callClients() {
                         gtf_i,gtf_i+(numForceSenders?master->old_num_groups_requested:0),
                         goi_i, goi_e, gov_i, gov_e,
                         forced_atoms_i,forced_atoms_e,forces_i,
-      receivedForceIDs.begin(),receivedForceIDs.end(),receivedTotalForces.begin());
+                        mf_i, mf_e, mtf_i);
     NAMD_EVENT_STOP(1, NamdProfileEvent::GM_PROC_DATA);
     a_i = receivedAtomIDs.begin();
     p_i = receivedAtomPositions.begin();

But I don't know if Colvars really needs it.

@giacomofiorin
Copy link
Member

* When initializing atom groups, Colvars has to communicate with the proxy directly (the SOA code `init_atom_from_proxy` inherits the design from the AOS code). In the case of `colvarproxy_namd`, this requires that the `atoms_map` to be initialized with -1, but after `cv reset` the map is cleared. There is no way to re-initialize `atoms_map` after `colvarproxy_namd::reset()` since `colvarproxy::add_config` and `colvarproxy::engine_ready()` are not overridable. The only hacky way is to call `init_atoms_map()` in `colvarproxy_namd::reset()`.

I would like to have a base-class function for the atoms map, because it's not a NAMD-specific issue. LAMMPS and GROMACS are not using it yet, but simply because they lack this optimization.

* I don't know why the atom IDs with total forces enabled in NAMD GM are not filtered out for each client. I could apply the following patch to solve the issue:

Ask the other folks at UIUC (including Jim), and you will find no fans of GlobalMaster's multiplexed design. But the alternative would have been multiple rounds of reduction/broadcast, and I think at the time it was not only more coding work but also potentially harmful to multinode performance.

(ii) I don't know if Colvars needs to "introspect" the forces from TCL forces. In other words, I don't know if this is what Colvars want;

I don't think we know what each person using Colvars really wants, but I agree that that's a very unlikely use case.

@giacomofiorin Could you look into the issues above?

Well, you did that just hours after

@giacomofiorin giacomofiorin added the bugfix To be used only in PRs label Dec 5, 2025
@giacomofiorin
Copy link
Member

Editing the PR's title and description for merging. Will close #891

@giacomofiorin giacomofiorin changed the title Remove unused class member Fix multiple (re)initialization issues in NAMD Dec 5, 2025
@HanatoK
Copy link
Member

HanatoK commented Dec 5, 2025

I would like to have a base-class function for the atoms map, because it's not a NAMD-specific issue. LAMMPS and GROMACS are not using it yet, but simply because they lack this optimization.

I didn't use atoms_map in the CUDAGM implementation as well. I think it might be something to workaround the issue when TCLForces and Colvars both use total forces, and Colvars needs the atoms_map to filter out the forces that are only required by other GM clients. But maybe I am wrong, I still don't fully understand the design and some history GlobalMaster code.

Ask the other folks at UIUC (including Jim), and you will find no fans of GlobalMaster's multiplexed design. But the alternative would have been multiple rounds of reduction/broadcast, and I think at the time it was not only more coding work but also potentially harmful to multinode performance.

I have tried to solve the issue and revise the NAMD patch that I posted here in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486. Could you take a look at that NAMD MR to see if Colvars really needs it to work correctly?

@giacomofiorin
Copy link
Member

I didn't use atoms_map in the CUDAGM implementation as well. I think it might be something to workaround the issue when TCLForces and Colvars both use total forces, and Colvars needs the atoms_map to filter out the forces that are only required by other GM clients. But maybe I am wrong, I still don't fully understand the design and some history GlobalMaster code.

It's not just because of the multiple GlobalMaster objects, but also because the messages received by GlobalMasterServer are not in order, either by task ID or by atom. Typical solutions to this would be sorting the atom buffer to match the order that Colvars expects, or using a lookup table. I am less familiar with the CUDAGM code in the NAMD repo than with the interface code that you have added here in the Colvars repo, but it seems that you are effectively sorting the atomic coordinates as you gather them?

I should also have been clearer on the other backends:

  • The LAMMPS interface is not using atoms_map, but it has its own lookup table that is based on internal LAMMPS code. (What it doesn't do compared to NAMD/GlobalMaster is skipping the MPI ranks that have no Colvars atoms).
  • GROMACS currently shares with Colvars the entire system (no need to sort anything), but it could use more optimization there: even with a GPU-resident interface for Colvars, which is currently just not there, GROMACS cannot run a MARTINI system in GPU-resident mode (even a large enough system that it would make sense). I don't know when that would change... People are still waiting for non-bonded tabulated potentials to be added back.

@giacomofiorin
Copy link
Member

giacomofiorin commented Dec 5, 2025

I have tried to solve the issue and revise the NAMD patch that I posted here in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486. Could you take a look at that NAMD MR to see if Colvars really needs it to work correctly?

Doing just that. I think it mostly depends on what the other NAMD folks will say. But if we can assume that the only atoms and forces that GlobalMasterColvars sees are the ones it requested, that would be an improvement.

@giacomofiorin
Copy link
Member

I have tried to solve the issue and revise the NAMD patch that I posted here in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486. Could you take a look at that NAMD MR to see if Colvars really needs it to work correctly?

Given this open MR, we maybe wait a few days to merge this? Or merge it now (after including patches here to reproduce the NAMD changes that you're already submitting in that MR)?

@HanatoK
Copy link
Member

HanatoK commented Dec 5, 2025

I have tried to solve the issue and revise the NAMD patch that I posted here in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486. Could you take a look at that NAMD MR to see if Colvars really needs it to work correctly?

Given this open MR, we maybe wait a few days to merge this? Or merge it now (after including patches here to reproduce the NAMD changes that you're already submitting in that MR)?

This PR currently doesn't completely solve the crash issue. Even after this one and https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486 get merged, I still have no idea how to re-initialize the atoms_map in Colvars.

@giacomofiorin
Copy link
Member

This PR currently doesn't completely solve the crash issue. Even after this one and https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486 get merged, I still have no idea how to re-initialize the atoms_map in Colvars.

You mean that there currently is no callback to Colvars to state that a new structure command has been loaded? If so, we should design a test that catches that condition better than the current 003_reinitatoms

(I can currently run that test successfully on my local machine even without the changes in your NAMD MR...)

@HanatoK
Copy link
Member

HanatoK commented Dec 5, 2025

You mean that there currently is no callback to Colvars to state that a new structure command has been loaded? If so, we should design a test that catches that condition better than the current 003_reinitatoms

I'm not sure. Maybe you can make any one of add_config, engine_ready and parse_module_config virtual and re-implement it with calling init_atoms_map before calling the corresponding base class function.

(I can currently run that test successfully on my local machine even without the changes in your NAMD MR...)

It does run with the default compilation flags with optimizations, but doesn't work if you use CXXOPTS = -g in your NAMD Make.config.

@giacomofiorin
Copy link
Member

I'm not sure. Maybe you can make any one of add_config, engine_ready and parse_module_config virtual and re-implement it with calling init_atoms_map before calling the corresponding base class function.

Yeah, that's definitely required, I'll do that.

The flip side of it is catching the use case when the system's topology changes while Colvars is still defined. That may need a change in NAMD.

@jhenin
Copy link
Member Author

jhenin commented Dec 7, 2025

The fix looks good! I wish I had seen the bug... Do you think a variant of the reinitatoms test could trigger the issue reliably?
Otherwise I'm fine with merging, but we won't have a regression test for this.

@giacomofiorin
Copy link
Member

The fix looks good! I wish I had seen the bug... Do you think a variant of the reinitatoms test could trigger the issue reliably? Otherwise I'm fine with merging, but we won't have a regression test for this.

I have no earthly idea! I couldn't see the bug locally either, I just took @HanatoK's word for it...

@HanatoK
Copy link
Member

HanatoK commented Dec 7, 2025

The fix looks good! I wish I had seen the bug... Do you think a variant of the reinitatoms test could trigger the issue reliably? Otherwise I'm fine with merging, but we won't have a regression test for this.

I have no earthly idea! I couldn't see the bug locally either, I just took @HanatoK's word for it...

Let me try to push a commit to force the boundary check and call cvm::error.

@HanatoK
Copy link
Member

HanatoK commented Dec 7, 2025

Hi @jhenin and @giacomofiorin ! Could you try bc27da1 locally? The ARM64 build currently skip the 003_reinitatoms test due to possibly the same crash.

@jhenin
Copy link
Member Author

jhenin commented Dec 8, 2025

Indeed! This does reveal the crash, and therefore the fix. Now I wish I had an ARM64 test box to investigate - do you?

@HanatoK
Copy link
Member

HanatoK commented Dec 8, 2025

Indeed! This does reveal the crash, and therefore the fix. Now I wish I had an ARM64 test box to investigate - do you?

I have one, but the issue now is that on ARM64, the total force is nan on the first step after the re-initialization (see https://github.com/Colvars/colvars/actions/runs/20033481534/job/57448548555#step:18:28), while it is -nan on x86-64, so the spiff fails. I have no idea about how to fix the issue so I have to disable the ARM64 test again.

giacomofiorin and others added 22 commits December 12, 2025 14:38
Now that we are overriding the GlobalMaster class functions, we can pass protected members
through a const accessor
This is just to trigger the bug on CI. You can revert this commit later
after solving the issue.
This might to solve the resetting and reloading crash in the NAMD
interface.
Currently spiff has no way to ignore the comparison of nan and -nan.
invalid

This commit simplifies the logic of resetting the total forces by
just checking if they are available before accumulating them in
collect_cvc_total_forces. This can ensure that ft_reported is not reset
after calc_colvar_properties. This commit also simplifies the GPU code
path by just reusing the check in collect_cvc_total_forces.
@HanatoK HanatoK force-pushed the remove_unused_member branch from 55a4ffe to 1e3d569 Compare December 12, 2025 20:48
HanatoK added a commit to HanatoK/colvars that referenced this pull request Dec 12, 2025
@HanatoK
Copy link
Member

HanatoK commented Dec 12, 2025

Just rebase this PR against the master branch since #881 has been just merged.

HanatoK added a commit to HanatoK/colvars that referenced this pull request Dec 12, 2025
void colvarproxy::set_total_forces_invalid()
{
std::fill(atoms_total_forces.begin(), atoms_total_forces.end(), cvm::rvector(0.0, 0.0, 0.0));
std::fill(atom_groups_total_forces.begin(), atom_groups_total_forces.end(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's necessary to clear the buffers. It looks fine even if the GPU buffers are not cleared.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you expect a performance gain by not clearing them? Is there a risk that it would become necessary due to upstream changes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either clearing or not is OK. I believe set_total_forces_invalid is not called very frequently so the performance is an issue.

@HanatoK HanatoK merged commit 80bd6e3 into master Dec 17, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix To be used only in PRs NAMD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Computation of total forces should be skipped after the system's topology was redefined

4 participants