Merge pull request #152 from markovmodel/revision-th

thempel · web-flow · commit 2f877300de40 · 2018-09-04T15:55:39.000+02:00
Revision TH
diff --git a/manuscript/literature.bib b/manuscript/literature.bib
@@ -630,3 +630,50 @@ @article{pcca++
     URL = {https://doi.org/10.1007/s11634-013-0134-6},
     DOI = {10.1007/s11634-013-0134-6}
 }
+@article{plattner_protein_2015,
+        title = {Protein conformational plasticity and complex ligand-binding kinetics explored by atomistic simulations and {Markov} models},
+        volume = {6},
+        url = {http://www.nature.com/ncomms/2015/150702/ncomms8653/full/ncomms8653.html},
+        doi = {10.1038/ncomms8653},
+        journal = {Nat. Commun.},
+        author = {Plattner, Nuria and Noé, Frank},
+        year = {2015},
+        pages = {7653}
+}
+@article{plattner_complete_2017,
+        title = {Complete protein–protein association kinetics in atomic detail revealed by molecular dynamics simulations and {Markov} modelling},
+        volume = {9},
+        issn = {1755-4349},
+        url = {https://www.nature.com/articles/nchem.2785},
+        doi = {10.1038/nchem.2785},
+        number = {10},
+        journal = {Nat. Chem.},
+        author = {Plattner, Nuria and Doerr, Stefan and Fabritiis, Gianni De and Noé, Frank},
+        month = oct,
+        year = {2017},
+        pages = {1005}
+}
+@inproceedings{aggarwal_surprising_2001,
+	series = {Lecture {Notes} in {Computer} {Science}},
+	title = {On the {Surprising} {Behavior} of {Distance} {Metrics} in {High} {Dimensional} {Space}},
+	isbn = {978-3-540-44503-6},
+	booktitle = {Database {Theory} — {ICDT} 2001},
+	publisher = {Springer Berlin Heidelberg},
+	author = {Aggarwal, Charu C. and Hinneburg, Alexander and Keim, Daniel A.},
+	editor = {Van den Bussche, Jan and Vianu, Victor},
+	year = {2001},
+	pages = {420--434},
+}
+@article{banushkina_nonparametric_2015,
+        title = {Nonparametric variational optimization of reaction coordinates},
+        volume = {143},
+        issn = {0021-9606},
+        url = {https://aip.scitation.org/doi/10.1063/1.4935180},
+        doi = {10.1063/1.4935180},
+        number = {18},
+        journal = {J. Chem. Phys.},
+        author = {Banushkina, Polina V. and Krivov, Sergei V.},
+        month = nov,
+        year = {2015},
+        pages = {184108}
+}
diff --git a/manuscript/manuscript.tex b/manuscript/manuscript.tex
@@ -410,6 +410,21 @@ \subsection{Summary}
 For the full analysis, please refer to the first notebook (00).
 All notebooks as well as detailed installation instructions are available on \githubrepository{}.
 
+\subsection{Modeling large systems}
+When estimating MSMs for large systems, challenges may arise that are mostly system dependent.
+
+A case in point is the curse of dimensionality:~it is difficult to discretize a high dimensional feature space. While it is somewhat computationally demanding, more importantly, Euclidean distances become less meaningful with increasing dimensionality~\cite{aggarwal_surprising_2001} and thus cluster assignments based on that norm may yield a poor discretization.
+Especially for large systems, it is thus particularly important to first find a suitable set of features, and to further apply dimensionality reduction techniques (e.g.~TICA, VAMP, if applicable) to obtain a low dimensional representation of the slow dynamics.
+Hidden Markov models (HMMs) might further mitigate poor discretization to a certain extent~\cite{noe-proj-hid-msm}.
+
+Furthermore, the slowest process in a system as identified by an MSM or HMM might not be the one a modeler is interested in~\cite{banushkina_nonparametric_2015}.
+For instance, the slowest process might correspond to a biologically irrelevant side chain flip that only occurred once in the data set.
+This problem may be mitigated by choosing a more specific set of features.
+
+Additional technical challenges for large systems include high demands on memory and computation time; we explain how to deal with those in the tutorials.
+
+More details on how to model complex systems with the techniques presented here are described e.g.~by~\cite{plattner_protein_2015,plattner_complete_2017}.
+
 \subsection{Advanced Methods}
 
 While the present tutorial is intended to cover Markov State Modeling 101, we encourage the user to explore other, more recent extensions of the methodology.
diff --git a/notebooks/00-pentapeptide-showcase.ipynb b/notebooks/00-pentapeptide-showcase.ipynb
@@ -1636,7 +1636,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.5"
+   "version": "3.6.3"
   },
   "toc": {
    "base_numbering": 1,
diff --git a/notebooks/02-dimension-reduction-and-discretization.ipynb b/notebooks/02-dimension-reduction-and-discretization.ipynb
@@ -28,9 +28,7 @@
     "import matplotlib.pyplot as plt\n",
     "import numpy as np\n",
     "import mdshare\n",
-    "import pyemma\n",
-    "\n",
-    "## Case 1: preprocessed, two-dimensional data (toy model)"
+    "import pyemma"
    ]
   },
   {
@@ -298,7 +296,7 @@
    "outputs": [],
    "source": [
     "fig, ax = plt.subplots()\n",
-    "i = ax.imshow(tica.feature_TIC_correlation)\n",
+    "i = ax.imshow(tica.feature_TIC_correlation, cmap='bwr', vmin=-1, vmax=1)\n",
     "\n",
     "ax.set_xticks([0])\n",
     "ax.set_xlabel('IC')\n",
@@ -363,8 +361,8 @@
     "for ax, cls in zip(axes.flat, [cluster_kmeans, cluster_regspace]):\n",
     "    pyemma.plots.plot_density(*data_concatenated.T, ax=ax, cbar=False, alpha=0.1, logscale=True)\n",
     "    ax.scatter(*cls.clustercenters.T, s=15, c='C1')\n",
-    "    ax.set_xlabel('$x$')\n",
-    "    ax.set_ylabel('$y$')\n",
+    "    ax.set_xlabel('$\\Phi$')\n",
+    "    ax.set_ylabel('$\\Psi$')\n",
     "fig.tight_layout()"
    ]
   },
@@ -525,9 +523,104 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "TICA, by default, uses a lag time of $10$ steps and a kinetic variance cutoff of $95\\%$ to determine the number of ICs. We observe that this projection does resolve some metastability in both ICs.\n",
+    "TICA, by default, uses a lag time of $10$ steps, kinetic mapping and a kinetic variance cutoff of $95\\%$ to determine the number of ICs. We observe that this projection does resolve some metastability in both ICs. Whether these projections are suitable for building Markov state models, though, remains to be seen in later tests.\n",
+    "\n",
+    "As we discussed in the first example, the physical meaning of the TICA projection is not directly clear. We can analyze the feature TIC correlation as we did above:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, ax = plt.subplots(figsize=(3, 8))\n",
+    "i = ax.imshow(tica.feature_TIC_correlation, cmap='bwr')\n",
+    "\n",
+    "ax.set_xticks(range(tica.dimension()))\n",
+    "ax.set_xlabel('IC')\n",
+    "\n",
+    "ax.set_yticks(range(feat.dimension()))\n",
+    "ax.set_yticklabels(feat.describe())\n",
+    "ax.set_ylabel('input feature')\n",
+    "\n",
+    "fig.colorbar(i);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This is not very helpful as it only shows that some of our $x, y, z$-coordinates correlate with the TICA components. Since we rather expect the slow processes to happen in backbone torsion space, this comes to no surprise. \n",
+    "\n",
+    "To understand what the TICs really mean, let us do a more systematic approach and scan through some angular features. We add some randomly chosen angles between heavy atoms and the backbone angles that we already know to be a good feature:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "feat_test = pyemma.coordinates.featurizer(pdb)\n",
+    "feat_test.add_backbone_torsions(periodic=False)\n",
+    "feat_test.add_angles(feat_test.select_Heavy()[:-1].reshape(3, 3), periodic=False)\n",
+    "data_test = pyemma.coordinates.load(files, features=feat_test)\n",
+    "data_test_concatenated = np.concatenate(data_test)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For the sake of simplicity, we use scipy's implementation of Pearson's correlation coefficient which we compute between our test features and TICA projected $x, y, z$-coordinates:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from scipy.stats import pearsonr\n",
+    "test_feature_TIC_correlation = np.zeros((feat_test.dimension(), tica.dimension()))\n",
+    "\n",
+    "for i in range(feat_test.dimension()):\n",
+    "    for j in range(tica.dimension()):\n",
+    "        test_feature_TIC_correlation[i, j] = pearsonr(data_test_concatenated[:, i], \n",
+    "                                                   tica_concatenated[:, j])[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vm = abs(test_feature_TIC_correlation).max()\n",
+    "\n",
+    "fig, ax = plt.subplots()\n",
+    "i = ax.imshow(test_feature_TIC_correlation, vmin=-vm, vmax=vm, cmap='bwr')\n",
     "\n",
-    "Whether these projections are suitable for building Markov state models, though, remains to be seen in later tests.\n",
+    "ax.set_xticks(range(tica.dimension()))\n",
+    "ax.set_xlabel('IC')\n",
+    "\n",
+    "ax.set_yticks(range(feat_test.dimension()))\n",
+    "ax.set_yticklabels(feat_test.describe())\n",
+    "ax.set_ylabel('input feature')\n",
+    "\n",
+    "fig.colorbar(i);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From this simple analysis, we find that features that correlated most with our TICA projection are indeed the backbone torsion angles used previously. We might thus expect the dynamics in TICA space to be similar to the one in backbone torsion space. Please note that in general, we do not know which feature would be a good observable. Thus, a realistic scenario might require a much broader scan of a large set of different features.\n",
+    "\n",
+    "However, it should be mentioned that TICA projections do not necessarily have a simple physical interpretation. The above analysis might very well end with feature TIC correlations that show no significant contributor and rather hint towards a complicated linear combination of input features.\n",
+    "\n",
+    "As an alternative to understanding the projection in detail at this stage, one might go one step further and extract representative structures e.g. from an MSM, as shown in [Notebook 05 📓](05-pcca-tpt.ipynb).\n",
     "\n",
     "#### Exercise 3: PCA parameters\n",
     "\n",
@@ -667,7 +760,15 @@
     "\n",
     "## Case 3: another molecular dynamics data set (pentapeptide)\n",
     "\n",
-    "We fetch the pentapeptide data set, load several different input features into memory and perform a VAMP estimation/scoring of each. Since we want to evaluate the VAMP score on a disjoint test set, we split the available files into a train and test set."
+    "Before we start to load and discretize the pentapeptide data set, let us discuss what the difficulties with larger protein systems are. The goal of this notebook is to find a state space discretization for MSM estimation. This means that an algorithm such as $k$-means has to be able to find a meaningful state space partitioning. In general, this works better in lower dimensional spaces because Euclidean distances become less meaningful with increasing dimensionality <a id=\"ref-4\" href=\"#cite-aggarwal_surprising_2001\">aggarwal-01</a>. The modeler should be aware that a discretization of hundreds of dimensions will be computationally expensive and most likely yield unsatisfactory results. \n",
+    "\n",
+    "The first goal is thus to map the data to a reasonable number of dimensions, e.g. with a smart choice of features and/or by using TICA. Large systems often require significant parts of the kinetic variance to be discarded in order to obtain a balance between capturing as much of the kinetic variance as possible and achieving a reasonable discretization.\n",
+    "\n",
+    "Another point about discretization algorithms is that one should bear in mind the distribution of density. The $k$-means algorithm tends to conserve density, i.e. data sets that incorporate regions of extremely high density as well as poorly sampled regions might be problematic, especially in high dimensions. For those cases, a regular spatial clustering might be worth a try. \n",
+    "\n",
+    "More details on problematic data situations and how to cope with them are explained in [Notebook 08 📓](08-common-problems.ipynb).\n",
+    "\n",
+    "Now, we fetch the pentapeptide data set, load several different input features into memory and perform a VAMP estimation/scoring of each. Since we want to evaluate the VAMP score on a disjoint test set, we split the available files into a train and test set."
    ]
   },
   {
@@ -1036,9 +1137,7 @@
     "    ax.scatter(*cluster.clustercenters[:, [i, j]].T, s=15, c='C1')\n",
     "    ax.set_xlabel('IC {}'.format(i + 1))\n",
     "    ax.set_ylabel('IC {}'.format(j + 1))\n",
-    "fig.tight_layout()\n",
-    "\n",
-    "## Wrapping up"
+    "fig.tight_layout()"
    ]
   },
   {
diff --git a/notebooks/03-msm-estimation-and-validation.ipynb b/notebooks/03-msm-estimation-and-validation.ipynb
@@ -140,7 +140,9 @@
    "source": [
     "We can see a perfect agreement between models estimated at higher lag times and predictions of the model at lag time $1$ step. Thus, we have estimated a valid MSM according to basic model validation.\n",
     "\n",
-    "Should a CK test fail, it means that the dynamics in the space of metastable states is not Markovian. This can have multiple reasons since it is the result of the combination of all steps in the pipeline. In practice, one would attempt to find a better model by tuning hyper-parameters such as the number of metastable states, the MSM lag time or the number of cluster centers. Back-tracking the error by following the pipeline in an upstream direction is usually advised. A failing CK test might further hint at poor sampling.\n",
+    "Should a CK test fail, it means that the dynamics in the space of metastable states is not Markovian. This can have multiple causes since it is the result of the combination of all steps in the pipeline. In practice, one would attempt to find a better model by tuning hyper-parameters such as the number of metastable states, the MSM lag time or the number of cluster centers. Back-tracking the error by following the pipeline in an upstream direction, i.e. by starting with the number of metastable states, is usually advised. \n",
+    "\n",
+    "A failing CK test might further hint at poor sampling. This case is explained in more detail in [Notebook 08 📓](08-common-problems.ipynb#poorly_sampled_dw).\n",
     "\n",
     "## Case 2: low-dimensional molecular dynamics data (alanine dipeptide)\n",
     "We fetch the alanine dipeptide data set, load the backbone torsions into memory and directly discretize the full space using $k$-means clustering. In order to demonstrate how to adjust the MSM lag time, we will first set the number of cluster centers to $200$ and justify this choice later."
@@ -238,7 +240,15 @@
    "source": [
     "We can see from this analysis that the ITS curves indeed converge towards the $200$ centers case and we can continue with estimating/validating an MSM.\n",
     "\n",
-    "We estimate an MSM at lag time $10$ ps and, given that we have three slow processes, perform a CK test for four metastable states. In general, the number of metastable states is a modeler's choice and will be explained in further notebooks."
+    "Before we continue with MSM estimation, let us discuss implied timescales convergence for large systems. Given sufficient sampling, the task is often to find a discretization that captures the process of interest well enough to obtain implied timescales that converge within the trajectory length. \n",
+    "\n",
+    "As we see in the above example with $k=20$ cluster centers, increasing the MSM lag time compensates for poor discretization to a certain extent. In a more realistic system, however, trajectories have a finite length that limits the choice of our MSM lag time. Furthermore, our clustering might be worse than the one presented above, so convergence might not be reached at all. Thus, we aim to converge the implied timescales at a low lag time by fine-tuning not only the number of cluster centers, but also feature selection and dimension reduction measures. This additionally ensures that our model has the maximum achievable temporal resolution.\n",
+    "\n",
+    "Please note that choosing an appropriate MSM lag time variationally (e.g. using VAMP scoring) is as far as we know not possible.\n",
+    "\n",
+    "Further details on how to account for poor discretization can be found in our notebook about hidden Markov models [Notebook 07 📓](07-hidden-markov-state-models.ipynb). An example on how implied timescales behave in the limit of poor sampling is shown in [Notebook 08 📓](08-common-problems.ipynb).\n",
+    "\n",
+    "Now, let's continue with the alanine dipeptide system. We estimate an MSM at lag time $10$ ps and, given that we have three slow processes, perform a CK test for four metastable states. In general, the number of metastable states is a modeler's choice and will be explained in further notebooks."
    ]
   },
   {
diff --git a/notebooks/08-common-problems.ipynb b/notebooks/08-common-problems.ipynb