|
28 | 28 | "import matplotlib.pyplot as plt\n",
|
29 | 29 | "import numpy as np\n",
|
30 | 30 | "import mdshare\n",
|
31 |
| - "import pyemma\n", |
32 |
| - "\n", |
33 |
| - "## Case 1: preprocessed, two-dimensional data (toy model)" |
| 31 | + "import pyemma" |
34 | 32 | ]
|
35 | 33 | },
|
36 | 34 | {
|
|
298 | 296 | "outputs": [],
|
299 | 297 | "source": [
|
300 | 298 | "fig, ax = plt.subplots()\n",
|
301 |
| - "i = ax.imshow(tica.feature_TIC_correlation)\n", |
| 299 | + "i = ax.imshow(tica.feature_TIC_correlation, cmap='bwr', vmin=-1, vmax=1)\n", |
302 | 300 | "\n",
|
303 | 301 | "ax.set_xticks([0])\n",
|
304 | 302 | "ax.set_xlabel('IC')\n",
|
|
363 | 361 | "for ax, cls in zip(axes.flat, [cluster_kmeans, cluster_regspace]):\n",
|
364 | 362 | " pyemma.plots.plot_density(*data_concatenated.T, ax=ax, cbar=False, alpha=0.1, logscale=True)\n",
|
365 | 363 | " ax.scatter(*cls.clustercenters.T, s=15, c='C1')\n",
|
366 |
| - " ax.set_xlabel('$x$')\n", |
367 |
| - " ax.set_ylabel('$y$')\n", |
| 364 | + " ax.set_xlabel('$\\Phi$')\n", |
| 365 | + " ax.set_ylabel('$\\Psi$')\n", |
368 | 366 | "fig.tight_layout()"
|
369 | 367 | ]
|
370 | 368 | },
|
|
525 | 523 | "cell_type": "markdown",
|
526 | 524 | "metadata": {},
|
527 | 525 | "source": [
|
528 |
| - "TICA, by default, uses a lag time of $10$ steps and a kinetic variance cutoff of $95\\%$ to determine the number of ICs. We observe that this projection does resolve some metastability in both ICs.\n", |
| 526 | + "TICA, by default, uses a lag time of $10$ steps, kinetic mapping and a kinetic variance cutoff of $95\\%$ to determine the number of ICs. We observe that this projection does resolve some metastability in both ICs. Whether these projections are suitable for building Markov state models, though, remains to be seen in later tests.\n", |
| 527 | + "\n", |
| 528 | + "As we discussed in the first example, the physical meaning of the TICA projection is not directly clear. We can analyze the feature TIC correlation as we did above:" |
| 529 | + ] |
| 530 | + }, |
| 531 | + { |
| 532 | + "cell_type": "code", |
| 533 | + "execution_count": null, |
| 534 | + "metadata": {}, |
| 535 | + "outputs": [], |
| 536 | + "source": [ |
| 537 | + "fig, ax = plt.subplots(figsize=(3, 8))\n", |
| 538 | + "i = ax.imshow(tica.feature_TIC_correlation, cmap='bwr')\n", |
| 539 | + "\n", |
| 540 | + "ax.set_xticks(range(tica.dimension()))\n", |
| 541 | + "ax.set_xlabel('IC')\n", |
| 542 | + "\n", |
| 543 | + "ax.set_yticks(range(feat.dimension()))\n", |
| 544 | + "ax.set_yticklabels(feat.describe())\n", |
| 545 | + "ax.set_ylabel('input feature')\n", |
| 546 | + "\n", |
| 547 | + "fig.colorbar(i);" |
| 548 | + ] |
| 549 | + }, |
| 550 | + { |
| 551 | + "cell_type": "markdown", |
| 552 | + "metadata": {}, |
| 553 | + "source": [ |
| 554 | + "This is not very helpful as it only shows that some of our $x, y, z$-coordinates correlate with the TICA components. Since we rather expect the slow processes to happen in backbone torsion space, this comes to no surprise. \n", |
| 555 | + "\n", |
| 556 | + "To understand what the TICs really mean, let us do a more systematic approach and scan through some angular features. We add some randomly chosen angles between heavy atoms and the backbone angles that we already know to be a good feature:" |
| 557 | + ] |
| 558 | + }, |
| 559 | + { |
| 560 | + "cell_type": "code", |
| 561 | + "execution_count": null, |
| 562 | + "metadata": {}, |
| 563 | + "outputs": [], |
| 564 | + "source": [ |
| 565 | + "feat_test = pyemma.coordinates.featurizer(pdb)\n", |
| 566 | + "feat_test.add_backbone_torsions(periodic=False)\n", |
| 567 | + "feat_test.add_angles(feat_test.select_Heavy()[:-1].reshape(3, 3), periodic=False)\n", |
| 568 | + "data_test = pyemma.coordinates.load(files, features=feat_test)\n", |
| 569 | + "data_test_concatenated = np.concatenate(data_test)" |
| 570 | + ] |
| 571 | + }, |
| 572 | + { |
| 573 | + "cell_type": "markdown", |
| 574 | + "metadata": {}, |
| 575 | + "source": [ |
| 576 | + "For the sake of simplicity, we use scipy's implementation of Pearson's correlation coefficient which we compute between our test features and TICA projected $x, y, z$-coordinates:" |
| 577 | + ] |
| 578 | + }, |
| 579 | + { |
| 580 | + "cell_type": "code", |
| 581 | + "execution_count": null, |
| 582 | + "metadata": {}, |
| 583 | + "outputs": [], |
| 584 | + "source": [ |
| 585 | + "from scipy.stats import pearsonr\n", |
| 586 | + "test_feature_TIC_correlation = np.zeros((feat_test.dimension(), tica.dimension()))\n", |
| 587 | + "\n", |
| 588 | + "for i in range(feat_test.dimension()):\n", |
| 589 | + " for j in range(tica.dimension()):\n", |
| 590 | + " test_feature_TIC_correlation[i, j] = pearsonr(data_test_concatenated[:, i], \n", |
| 591 | + " tica_concatenated[:, j])[0]" |
| 592 | + ] |
| 593 | + }, |
| 594 | + { |
| 595 | + "cell_type": "code", |
| 596 | + "execution_count": null, |
| 597 | + "metadata": {}, |
| 598 | + "outputs": [], |
| 599 | + "source": [ |
| 600 | + "vm = abs(test_feature_TIC_correlation).max()\n", |
| 601 | + "\n", |
| 602 | + "fig, ax = plt.subplots()\n", |
| 603 | + "i = ax.imshow(test_feature_TIC_correlation, vmin=-vm, vmax=vm, cmap='bwr')\n", |
529 | 604 | "\n",
|
530 |
| - "Whether these projections are suitable for building Markov state models, though, remains to be seen in later tests.\n", |
| 605 | + "ax.set_xticks(range(tica.dimension()))\n", |
| 606 | + "ax.set_xlabel('IC')\n", |
| 607 | + "\n", |
| 608 | + "ax.set_yticks(range(feat_test.dimension()))\n", |
| 609 | + "ax.set_yticklabels(feat_test.describe())\n", |
| 610 | + "ax.set_ylabel('input feature')\n", |
| 611 | + "\n", |
| 612 | + "fig.colorbar(i);" |
| 613 | + ] |
| 614 | + }, |
| 615 | + { |
| 616 | + "cell_type": "markdown", |
| 617 | + "metadata": {}, |
| 618 | + "source": [ |
| 619 | + "From this simple analysis, we find that features that correlated most with our TICA projection are indeed the backbone torsion angles used previously. We might thus expect the dynamics in TICA space to be similar to the one in backbone torsion space. Please note that in general, we do not know which feature would be a good observable. Thus, a realistic scenario might require a much broader scan of a large set of different features.\n", |
| 620 | + "\n", |
| 621 | + "However, it should be mentioned that TICA projections do not necessarily have a simple physical interpretation. The above analysis might very well end with feature TIC correlations that show no significant contributor and rather hint towards a complicated linear combination of input features.\n", |
| 622 | + "\n", |
| 623 | + "As an alternative to understanding the projection in detail at this stage, one might go one step further and extract representative structures e.g. from an MSM, as shown in [Notebook 05 📓](05-pcca-tpt.ipynb).\n", |
531 | 624 | "\n",
|
532 | 625 | "#### Exercise 3: PCA parameters\n",
|
533 | 626 | "\n",
|
|
667 | 760 | "\n",
|
668 | 761 | "## Case 3: another molecular dynamics data set (pentapeptide)\n",
|
669 | 762 | "\n",
|
670 |
| - "We fetch the pentapeptide data set, load several different input features into memory and perform a VAMP estimation/scoring of each. Since we want to evaluate the VAMP score on a disjoint test set, we split the available files into a train and test set." |
| 763 | + "Before we start to load and discretize the pentapeptide data set, let us discuss what the difficulties with larger protein systems are. The goal of this notebook is to find a state space discretization for MSM estimation. This means that an algorithm such as $k$-means has to be able to find a meaningful state space partitioning. In general, this works better in lower dimensional spaces because Euclidean distances become less meaningful with increasing dimensionality <a id=\"ref-4\" href=\"#cite-aggarwal_surprising_2001\">aggarwal-01</a>. The modeler should be aware that a discretization of hundreds of dimensions will be computationally expensive and most likely yield unsatisfactory results. \n", |
| 764 | + "\n", |
| 765 | + "The first goal is thus to map the data to a reasonable number of dimensions, e.g. with a smart choice of features and/or by using TICA. Large systems often require significant parts of the kinetic variance to be discarded in order to obtain a balance between capturing as much of the kinetic variance as possible and achieving a reasonable discretization.\n", |
| 766 | + "\n", |
| 767 | + "Another point about discretization algorithms is that one should bear in mind the distribution of density. The $k$-means algorithm tends to conserve density, i.e. data sets that incorporate regions of extremely high density as well as poorly sampled regions might be problematic, especially in high dimensions. For those cases, a regular spatial clustering might be worth a try. \n", |
| 768 | + "\n", |
| 769 | + "More details on problematic data situations and how to cope with them are explained in [Notebook 08 📓](08-common-problems.ipynb).\n", |
| 770 | + "\n", |
| 771 | + "Now, we fetch the pentapeptide data set, load several different input features into memory and perform a VAMP estimation/scoring of each. Since we want to evaluate the VAMP score on a disjoint test set, we split the available files into a train and test set." |
671 | 772 | ]
|
672 | 773 | },
|
673 | 774 | {
|
|
1036 | 1137 | " ax.scatter(*cluster.clustercenters[:, [i, j]].T, s=15, c='C1')\n",
|
1037 | 1138 | " ax.set_xlabel('IC {}'.format(i + 1))\n",
|
1038 | 1139 | " ax.set_ylabel('IC {}'.format(j + 1))\n",
|
1039 |
| - "fig.tight_layout()\n", |
1040 |
| - "\n", |
1041 |
| - "## Wrapping up" |
| 1140 | + "fig.tight_layout()" |
1042 | 1141 | ]
|
1043 | 1142 | },
|
1044 | 1143 | {
|
|
0 commit comments