minor changes, review beh

thempel · thempel · commit 422276d04962 · 2018-09-03T14:16:45.000+02:00
diff --git a/manuscript/literature.bib b/manuscript/literature.bib
@@ -624,7 +624,7 @@ @article{plattner_protein_2015
         volume = {6},
         url = {http://www.nature.com/ncomms/2015/150702/ncomms8653/full/ncomms8653.html},
         doi = {10.1038/ncomms8653},
-        journal = {Nat Commun},
+        journal = {Nat. Commun.},
         author = {Plattner, Nuria and Noé, Frank},
         year = {2015},
         pages = {7653}
@@ -636,7 +636,7 @@ @article{plattner_complete_2017
         url = {https://www.nature.com/articles/nchem.2785},
         doi = {10.1038/nchem.2785},
         number = {10},
-        journal = {Nature Chemistry},
+        journal = {Nat. Chem.},
         author = {Plattner, Nuria and Doerr, Stefan and Fabritiis, Gianni De and Noé, Frank},
         month = oct,
         year = {2017},
@@ -653,3 +653,16 @@ @inproceedings{aggarwal_surprising_2001
 	year = {2001},
 	pages = {420--434},
 }
+@article{banushkina_nonparametric_2015,
+        title = {Nonparametric variational optimization of reaction coordinates},
+        volume = {143},
+        issn = {0021-9606},
+        url = {https://aip.scitation.org/doi/10.1063/1.4935180},
+        doi = {10.1063/1.4935180},
+        number = {18},
+        journal = {J. Chem. Phys.},
+        author = {Banushkina, Polina V. and Krivov, Sergei V.},
+        month = nov,
+        year = {2015},
+        pages = {184108}
+}
diff --git a/manuscript/manuscript.tex b/manuscript/manuscript.tex
@@ -294,19 +294,17 @@ \subsection{Summary}
 \subsection{Modeling large systems}
 When estimating MSMs for large systems, challenges may arise that are mostly system dependent.
 
-A case in point is the curse of dimensionality:
-it is hard to discretize a high dimensional feature space, not only because it is computationally demanding.
-More important, Euclidean distances become less meaningful with increasing dimensionality~\cite{aggarwal_surprising_2001} and, thus, cluster assignment based on that norm are prone to yield poor discretizations.
-Especially for large systems, it is thus particularly important for first find a suitable set of features, and to further apply dimension reduction techniques (e.g. TICA, VAMP, if applicable) to obtain a low-dimensional representation of the slow dynamics.
-Hidden Markov models might further mitigate poor discretization to a certain extend~\cite{noe-proj-hid-msm}.
+A case in point is the curse of dimensionality:~it is difficult to discretize a high dimensional feature space. While it is somewhat computationally demanding, more importantly, Euclidean distances become less meaningful with increasing dimensionality~\cite{aggarwal_surprising_2001} and thus cluster assignments based on that norm may yield a poor discretization.
+Especially for large systems, it is thus particularly important to first find a suitable set of features, and to further apply dimensionality reduction techniques (e.g.~TICA, VAMP, if applicable) to obtain a low dimensional representation of the slow dynamics.
+Hidden Markov models (HMMs) might further mitigate poor discretization to a certain extent~\cite{noe-proj-hid-msm}.
 
-In addition, the slowest process in a system as identified by an MSM or HMM might not be the one a modeler is interested in.
+Furthermore, the slowest process in a system as identified by an MSM or HMM might not be the one a modeler is interested in~\cite{banushkina_nonparametric_2015}.
 For instance, the slowest process might correspond to a biologically irrelevant side chain flip that only occurred once in the data set.
-This problem can often be mitigated by choosing a more specific set of features.
+This problem may be mitigated by choosing a more specific set of features.
 
-The technical challenges with large systems are usually high demands in memory and computation time; we explain how to deal with those in the tutorials.
+Additional technical challenges for large systems include high demands on memory and computation time; we explain how to deal with those in the tutorials.
 
-More details on how to model complex systems with the techniques presented here are described e.g. by~\cite{plattner_protein_2015,plattner_complete_2017}.
+More details on how to model complex systems with the techniques presented here are described e.g.~by~\cite{plattner_protein_2015,plattner_complete_2017}.
 
 \subsection{Advanced Methods}
 
diff --git a/notebooks/02-dimension-reduction-and-discretization.ipynb b/notebooks/02-dimension-reduction-and-discretization.ipynb
@@ -764,7 +764,7 @@
     "\n",
     "The first goal is thus to map the data to a reasonable number of dimensions, e.g. with a smart choice of features and/or by using TICA. Large systems often require significant parts of the kinetic variance to be discarded in order to obtain a balance between capturing as much of the kinetic variance as possible and achieving a reasonable discretization.\n",
     "\n",
-    "Another point about discretization algorithms is that one should bear in mind the distribution of density. The $k$-means algorithm conserves density, i.e. data sets that incorporate regions of extremely high density as well as poorly sampled regions might be problematic, especially in high dimensions. For those cases, a regular spatial clustering might be worth a try. \n",
+    "Another point about discretization algorithms is that one should bear in mind the distribution of density. The $k$-means algorithm tends to conserve density, i.e. data sets that incorporate regions of extremely high density as well as poorly sampled regions might be problematic, especially in high dimensions. For those cases, a regular spatial clustering might be worth a try. \n",
     "\n",
     "More details on problematic data situations and how to cope with them are explained in [Notebook 08 📓](08-common-problems.ipynb).\n",
     "\n",
diff --git a/notebooks/03-msm-estimation-and-validation.ipynb b/notebooks/03-msm-estimation-and-validation.ipynb
@@ -242,7 +242,9 @@
     "\n",
     "Before we continue with MSM estimation, let us discuss implied timescales convergence for large systems. Given sufficient sampling, the task is often to find a discretization that captures the process of interest well enough to obtain implied timescales that converge within the trajectory length. \n",
     "\n",
-    "As we see in the above example with $k=20$ centers, increasing the lag time compensates for poor discretization to a certain extent. In a more realistic system, however, trajectories have a finite length that limits the choice of our lag time. Furthermore, our clustering might be worse than the one presented above, so convergence might not be reached at all. Thus, we aim to converge the implied timescales at a low lag time by fine-tuning not only the number of cluster centers, but also feature selection and dimension reduction measures. This additionally ensures that our model has the maximum achievable temporal resolution.\n",
+    "As we see in the above example with $k=20$ cluster centers, increasing the MSM lag time compensates for poor discretization to a certain extent. In a more realistic system, however, trajectories have a finite length that limits the choice of our MSM lag time. Furthermore, our clustering might be worse than the one presented above, so convergence might not be reached at all. Thus, we aim to converge the implied timescales at a low lag time by fine-tuning not only the number of cluster centers, but also feature selection and dimension reduction measures. This additionally ensures that our model has the maximum achievable temporal resolution.\n",
+    "\n",
+    "Please note that choosing an appropriate MSM lag time variationally (e.g. using VAMP scoring) is as far as we know not possible.\n",
     "\n",
     "Further details on how to account for poor discretization can be found in our notebook about hidden Markov models [Notebook 07 📓](07-hidden-markov-state-models.ipynb). An example on how implied timescales behave in the limit of poor sampling is shown in [Notebook 08 📓](08-common-problems.ipynb).\n",
     "\n",
diff --git a/notebooks/08-common-problems.ipynb b/notebooks/08-common-problems.ipynb
@@ -414,7 +414,7 @@
    "source": [
     "As we see, the requested timescales above 4 steps could not be computed because the underlying HMM is disconnected, i.e. the corresponding timescales are infinity. The implied timescales that could be computed are most likely the same process that we observed from the fine clustering before, i.e. jumps within one basin.\n",
     "\n",
-    "In general, it is a non-trivial problem to show that processes were not sampled reversibly. In our experience, HMMs are a good choice here, even though situations can occur where they might not detect the problem as easily as here. \n",
+    "In general, it is a non-trivial problem to show that processes were not sampled reversibly. In our experience, HMMs are a good choice here, even though situations can occur where they might not detect the problem as easily as in this example. \n",
     "\n",
     "<a id=\"poorly_sampled_dw\"></a>\n",
     "### poorly sampled double-well trajectories\n",
@@ -485,7 +485,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We note that the slowest process is clearly contained in the data chunks and is reversibly sampled (left panel, short trajectory pieces color coded and stacked). Due to very short trajectories, we find that this process can only be captured at a very low lag time (right panel). Above that interval, the slowest timescale diverges. Luckily, here we know that it is already converged at $\\tau = 1$, so we estimate an MSM:"
+    "We note that the slowest process is clearly contained in the data chunks and is reversibly sampled (left panel, short trajectory pieces color coded and stacked). Due to very short trajectories, we find that this process can only be captured at a very short MSM lag time (right panel). Above that interval, the slowest timescale diverges. Luckily, here we know that it is already converged at $\\tau = 1$, so we estimate an MSM:"
    ]
   },
   {
@@ -519,7 +519,7 @@
    "source": [
     "As already discussed, we cannot expect new estimates above a certain lag time to agree with the model prediction due to too short trajectories. Indeed, we find that new estimates and model predictions diverge at very high lag times. This does not necessarily mean that the model at $\\tau=1$ is wrong and in this particular case, we can even explain the divergence and find that it fits to the implied timescales divergence. \n",
     "\n",
-    "This example mirrors another incarnation of the sampling problem: Working with large systems, we often have comparably short trajectories with few rare events. Thus, implied timescales convergence can often be achieved only in a certain interval and CK-tests will not convergence up to arbitrary multiples of the lag time. It is the responsibility of the modeler to interpret these results and to ensure that a valid model can be obtained from the data.\n",
+    "This example mirrors another incarnation of the sampling problem: Working with large systems, we often have comparably short trajectories with few rare events. Thus, implied timescales convergence can often be achieved only in a certain interval and CK-tests will not converge up to arbitrary multiples of the lag time. It is the responsibility of the modeler to interpret these results and to ensure that a valid model can be obtained from the data.\n",
     "\n",
     "Please note that this is only a special case of a failed CK test. More general information about CK tests and what it means if it fails are explained in [Notebook 03 📓](03-msm-estimation-and-validation.ipynb).\n",
     "\n",