You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: manuscript/manuscript.tex
+7-9
Original file line number
Diff line number
Diff line change
@@ -294,19 +294,17 @@ \subsection{Summary}
294
294
\subsection{Modeling large systems}
295
295
When estimating MSMs for large systems, challenges may arise that are mostly system dependent.
296
296
297
-
A case in point is the curse of dimensionality:
298
-
it is hard to discretize a high dimensional feature space, not only because it is computationally demanding.
299
-
More important, Euclidean distances become less meaningful with increasing dimensionality~\cite{aggarwal_surprising_2001} and, thus, cluster assignment based on that norm are prone to yield poor discretizations.
300
-
Especially for large systems, it is thus particularly important for first find a suitable set of features, and to further apply dimension reduction techniques (e.g. TICA, VAMP, if applicable) to obtain a low-dimensional representation of the slow dynamics.
301
-
Hidden Markov models might further mitigate poor discretization to a certain extend~\cite{noe-proj-hid-msm}.
297
+
A case in point is the curse of dimensionality:~it is difficult to discretize a high dimensional feature space. While it is somewhat computationally demanding, more importantly, Euclidean distances become less meaningful with increasing dimensionality~\cite{aggarwal_surprising_2001} and thus cluster assignments based on that norm may yield a poor discretization.
298
+
Especially for large systems, it is thus particularly important to first find a suitable set of features, and to further apply dimensionality reduction techniques (e.g.~TICA, VAMP, if applicable) to obtain a low dimensional representation of the slow dynamics.
299
+
Hidden Markov models (HMMs) might further mitigate poor discretization to a certain extent~\cite{noe-proj-hid-msm}.
302
300
303
-
In addition, the slowest process in a system as identified by an MSM or HMM might not be the one a modeler is interested in.
301
+
Furthermore, the slowest process in a system as identified by an MSM or HMM might not be the one a modeler is interested in~\cite{banushkina_nonparametric_2015}.
304
302
For instance, the slowest process might correspond to a biologically irrelevant side chain flip that only occurred once in the data set.
305
-
This problem can often be mitigated by choosing a more specific set of features.
303
+
This problem may be mitigated by choosing a more specific set of features.
306
304
307
-
The technical challenges with large systems are usually high demands in memory and computation time; we explain how to deal with those in the tutorials.
305
+
Additional technical challenges for large systems include high demands on memory and computation time; we explain how to deal with those in the tutorials.
308
306
309
-
More details on how to model complex systems with the techniques presented here are described e.g.by~\cite{plattner_protein_2015,plattner_complete_2017}.
307
+
More details on how to model complex systems with the techniques presented here are described e.g.~by~\cite{plattner_protein_2015,plattner_complete_2017}.
Copy file name to clipboardExpand all lines: notebooks/02-dimension-reduction-and-discretization.ipynb
+1-1
Original file line number
Diff line number
Diff line change
@@ -764,7 +764,7 @@
764
764
"\n",
765
765
"The first goal is thus to map the data to a reasonable number of dimensions, e.g. with a smart choice of features and/or by using TICA. Large systems often require significant parts of the kinetic variance to be discarded in order to obtain a balance between capturing as much of the kinetic variance as possible and achieving a reasonable discretization.\n",
766
766
"\n",
767
-
"Another point about discretization algorithms is that one should bear in mind the distribution of density. The $k$-means algorithm conserves density, i.e. data sets that incorporate regions of extremely high density as well as poorly sampled regions might be problematic, especially in high dimensions. For those cases, a regular spatial clustering might be worth a try. \n",
767
+
"Another point about discretization algorithms is that one should bear in mind the distribution of density. The $k$-means algorithm tends to conserve density, i.e. data sets that incorporate regions of extremely high density as well as poorly sampled regions might be problematic, especially in high dimensions. For those cases, a regular spatial clustering might be worth a try. \n",
768
768
"\n",
769
769
"More details on problematic data situations and how to cope with them are explained in [Notebook 08 📓](08-common-problems.ipynb).\n",
Copy file name to clipboardExpand all lines: notebooks/03-msm-estimation-and-validation.ipynb
+3-1
Original file line number
Diff line number
Diff line change
@@ -242,7 +242,9 @@
242
242
"\n",
243
243
"Before we continue with MSM estimation, let us discuss implied timescales convergence for large systems. Given sufficient sampling, the task is often to find a discretization that captures the process of interest well enough to obtain implied timescales that converge within the trajectory length. \n",
244
244
"\n",
245
-
"As we see in the above example with $k=20$ centers, increasing the lag time compensates for poor discretization to a certain extent. In a more realistic system, however, trajectories have a finite length that limits the choice of our lag time. Furthermore, our clustering might be worse than the one presented above, so convergence might not be reached at all. Thus, we aim to converge the implied timescales at a low lag time by fine-tuning not only the number of cluster centers, but also feature selection and dimension reduction measures. This additionally ensures that our model has the maximum achievable temporal resolution.\n",
245
+
"As we see in the above example with $k=20$ cluster centers, increasing the MSM lag time compensates for poor discretization to a certain extent. In a more realistic system, however, trajectories have a finite length that limits the choice of our MSM lag time. Furthermore, our clustering might be worse than the one presented above, so convergence might not be reached at all. Thus, we aim to converge the implied timescales at a low lag time by fine-tuning not only the number of cluster centers, but also feature selection and dimension reduction measures. This additionally ensures that our model has the maximum achievable temporal resolution.\n",
246
+
"\n",
247
+
"Please note that choosing an appropriate MSM lag time variationally (e.g. using VAMP scoring) is as far as we know not possible.\n",
246
248
"\n",
247
249
"Further details on how to account for poor discretization can be found in our notebook about hidden Markov models [Notebook 07 📓](07-hidden-markov-state-models.ipynb). An example on how implied timescales behave in the limit of poor sampling is shown in [Notebook 08 📓](08-common-problems.ipynb).\n",
Copy file name to clipboardExpand all lines: notebooks/08-common-problems.ipynb
+3-3
Original file line number
Diff line number
Diff line change
@@ -414,7 +414,7 @@
414
414
"source": [
415
415
"As we see, the requested timescales above 4 steps could not be computed because the underlying HMM is disconnected, i.e. the corresponding timescales are infinity. The implied timescales that could be computed are most likely the same process that we observed from the fine clustering before, i.e. jumps within one basin.\n",
416
416
"\n",
417
-
"In general, it is a non-trivial problem to show that processes were not sampled reversibly. In our experience, HMMs are a good choice here, even though situations can occur where they might not detect the problem as easily as here. \n",
417
+
"In general, it is a non-trivial problem to show that processes were not sampled reversibly. In our experience, HMMs are a good choice here, even though situations can occur where they might not detect the problem as easily as in this example. \n",
418
418
"\n",
419
419
"<a id=\"poorly_sampled_dw\"></a>\n",
420
420
"### poorly sampled double-well trajectories\n",
@@ -485,7 +485,7 @@
485
485
"cell_type": "markdown",
486
486
"metadata": {},
487
487
"source": [
488
-
"We note that the slowest process is clearly contained in the data chunks and is reversibly sampled (left panel, short trajectory pieces color coded and stacked). Due to very short trajectories, we find that this process can only be captured at a very low lag time (right panel). Above that interval, the slowest timescale diverges. Luckily, here we know that it is already converged at $\\tau = 1$, so we estimate an MSM:"
488
+
"We note that the slowest process is clearly contained in the data chunks and is reversibly sampled (left panel, short trajectory pieces color coded and stacked). Due to very short trajectories, we find that this process can only be captured at a very short MSM lag time (right panel). Above that interval, the slowest timescale diverges. Luckily, here we know that it is already converged at $\\tau = 1$, so we estimate an MSM:"
489
489
]
490
490
},
491
491
{
@@ -519,7 +519,7 @@
519
519
"source": [
520
520
"As already discussed, we cannot expect new estimates above a certain lag time to agree with the model prediction due to too short trajectories. Indeed, we find that new estimates and model predictions diverge at very high lag times. This does not necessarily mean that the model at $\\tau=1$ is wrong and in this particular case, we can even explain the divergence and find that it fits to the implied timescales divergence. \n",
521
521
"\n",
522
-
"This example mirrors another incarnation of the sampling problem: Working with large systems, we often have comparably short trajectories with few rare events. Thus, implied timescales convergence can often be achieved only in a certain interval and CK-tests will not convergence up to arbitrary multiples of the lag time. It is the responsibility of the modeler to interpret these results and to ensure that a valid model can be obtained from the data.\n",
522
+
"This example mirrors another incarnation of the sampling problem: Working with large systems, we often have comparably short trajectories with few rare events. Thus, implied timescales convergence can often be achieved only in a certain interval and CK-tests will not converge up to arbitrary multiples of the lag time. It is the responsibility of the modeler to interpret these results and to ensure that a valid model can be obtained from the data.\n",
523
523
"\n",
524
524
"Please note that this is only a special case of a failed CK test. More general information about CK tests and what it means if it fails are explained in [Notebook 03 📓](03-msm-estimation-and-validation.ipynb).\n",
0 commit comments