diff --git a/notebooks/foundations/enhanced-catalog.ipynb b/notebooks/foundations/enhanced-catalog.ipynb index 7295a6d..dc3ad76 100644 --- a/notebooks/foundations/enhanced-catalog.ipynb +++ b/notebooks/foundations/enhanced-catalog.ipynb @@ -26,7 +26,7 @@ "metadata": {}, "source": [ "## Overview\n", - "This notebook compares the original [Intake-ESM](https://intake-esm.readthedocs.io/en/stable/) catalog with an enhanced catalog that includes additional attributes. Both catalogs are an inventory of the NCAR Community Earth System Model (CESM) Large Ensemble (LENS) data hosted on AWS S3 ([doi:10.26024/wt24-5j82](https://doi.org/10.26024/wt24-5j82))." + "This notebook compares one [Intake-ESM](https://intake-esm.readthedocs.io/en/stable/) catalog with an enhanced version that includes additional attributes. Both catalogs are an inventory of the NCAR Community Earth System Model (CESM) Large Ensemble (LENS) data hosted on AWS S3 ([doi:10.26024/wt24-5j82](https://doi.org/10.26024/wt24-5j82))." ] }, { @@ -63,15 +63,7 @@ "outputs": [], "source": [ "import intake\n", - "import pandas as pd\n", - "import pprint\n", - "\n", - "# Allow multiple lines per cell to be displayed without print (default is just last line)\n", - "from IPython.core.interactiveshell import InteractiveShell\n", - "InteractiveShell.ast_node_interactivity = \"all\"\n", - "\n", - "# Enable more explicit control of DataFrame display (e.g., to omit annoying line numbers)\n", - "from IPython.display import HTML" + "import pandas as pd" ] }, { @@ -85,7 +77,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Open the original collection description file:" + "At import time, the `intake-esm` plugin is available in `intake`’s registry as `esm_datastore` and can be accessed with `intake.open_esm_datastore()` function. " ] }, { @@ -95,7 +87,15 @@ "outputs": [], "source": [ "cat_url_orig = 'https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json'\n", - "coll_orig = intake.open_esm_datastore(cat_url_orig)" + "coll_orig = intake.open_esm_datastore(cat_url_orig)\n", + "coll_orig" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here's a summary representation:" ] }, { @@ -104,16 +104,14 @@ "metadata": {}, "outputs": [], "source": [ - "print(coll_orig.esmcol_data['description']) #Description of collection\n", - "print(\"Catalog file:\", coll_orig.esmcol_data['catalog_file'])\n", - "print(coll_orig) # Summary of collection structure" + "print(coll_orig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Show an expanded version of the collection structure with details:" + "In an Intake-ESM catalog object, the `esmcat` class provides many useful attributes and functions. For example, we can get the collection's description:" ] }, { @@ -122,15 +120,14 @@ "metadata": {}, "outputs": [], "source": [ - "uniques_orig = coll_orig.unique(columns=[\"component\", \"frequency\", \"experiment\", \"variable\"])\n", - "pprint.pprint(uniques_orig, compact=True, indent=1, width=80)" + "coll_orig.esmcat.description" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Show the first few lines of the catalog. There are as many lines as there are paths. The order is the same as that of the CSV catalog file listed in the JSON description file." + "We can also get the URL pointing to the catalog's underlying tabular representation:" ] }, { @@ -139,16 +136,61 @@ "metadata": {}, "outputs": [], "source": [ - "print(\"Catalog file:\", coll_orig.esmcol_data['catalog_file'])\n", - "df = coll_orig.df\n", - "HTML(df.head(10).to_html(index=False))" + "coll_orig.esmcat.catalog_file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**Table**: First few lines of the original Intake-ESM catalog showing the model component, the temporal frequency, the experiment, the abbreviated variable name, and the AWS S3 path for each Zarr store." + "That's a CSV file ... let's take a peek." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df_orig = pd.read_csv(coll_orig.esmcat.catalog_file)\n", + "df_orig" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "However, we can save a step since an ESM catalog object provides a `df` instance which returns a dataframe too:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df_orig = coll_orig.df\n", + "df_orig" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Print out a sorted list of the unique values of selected columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for col in ['component', 'frequency', 'experiment', 'variable']:\n", + " unique_vals = coll_orig.unique()[col]\n", + " unique_vals.sort()\n", + " count = len(unique_vals)\n", + " print (col + ': ' ,unique_vals, \" count: \", count, '\\n')" ] }, { @@ -172,7 +214,7 @@ "outputs": [], "source": [ "df = coll_orig.search(variable='FLNS').df\n", - "HTML(df.to_html(index=False))" + "df" ] }, { @@ -189,37 +231,36 @@ "outputs": [], "source": [ "df = coll_orig.search(variable='FLNS', frequency='daily', experiment='RCP85').df\n", - "HTML(df.to_html(index=False))" + "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## The Problem" + "
The problem: Do all potential users know that `FLNS` is a CESM-specific abbreviation for \"net longwave flux at surface”? How would a novice user find out, other than by finding separate documentation, or by opening a Zarr store in the hopes that the long name might be recorded there? How do we address the fact that every climate model code seems to have a different, non-standard name for all the variables, thus making multi-source research needlessly difficult?
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Do all potential users know that `FLNS` is a CESM-specific abbreviation for “Net longwave flux at surface”? How would a novice user find out, other than by finding separate documentation, or by opening a Zarr store in the hopes that the long name might be recorded there? How do we address the fact that every climate model code seems to have a different, non-standard name for all the variables, thus making multi-source research needlessly difficult?" + "
The solution:
\n", + "

Enhanced Intake-ESM Catalog!

" ] }, { "cell_type": "markdown", - "metadata": { - "tags": [] - }, + "metadata": {}, "source": [ - "## Enhanced Intake-ESM Catalog" + "By adding additional columns to the Intake-ESM catalog, we should be able to improve semantic interoperability and provide potentially useful information to the users. Let's now open the *enhanced* collection description file:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "By adding additional columns to the Intake-ESM catalog, we should be able to improve semantic interoperability and provide potentially useful information to the users. Let's now open the enhanced collection description file:" + "
Note: The URL for the enhanced catalog differs from the original only in that it has -enhanced appended to aws-cesm1-le
" ] }, { @@ -233,14 +274,21 @@ "coll" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As we did for the first catalog, let's obtain the `description` and `catalog_file` attributes." + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "print(coll.esmcol_data['description']) # Description of collection\n", - "print(\"Catalog file:\", coll.esmcol_data['catalog_file'])\n", + "print(coll.esmcat.description) # Description of collection\n", + "print(\"Catalog file:\", coll.esmcat.catalog_file)\n", "print(coll) # Summary of collection structure" ] }, @@ -257,7 +305,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In the summary above, note the addition of additional elements: `long_name`, `start`, `end`, and `dim`. Here are the first few lines of the enhanced catalog:" + "In the catalog's representation above, note the addition of additional elements: `long_name`, `start`, `end`, and `dim`. Here are the first/last few lines of the enhanced catalog:" ] }, { @@ -266,15 +314,8 @@ "metadata": {}, "outputs": [], "source": [ - "print(\"Catalog file:\", coll.esmcol_data['catalog_file'])\n", - "HTML(coll.df.head(10).to_html(index=False))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Table**: First few lines of the enhanced catalog, listing of the same information as the original catalog as well as the long name of each variable and an indication of whether each variable is 2D or 3D." + "df_enh = coll.df\n", + "df_enh" ] }, { @@ -283,8 +324,8 @@ "source": [ "
\n", "

Warning

\n", - " The long names are not CF Standard Names, but rather are those documented at \n", - "http://www.cgd.ucar.edu/ccr/strandwg/CESM-CAM5-BGC_LENS_fields.html. For interoperability, the long_name column should be replaced by a cf_name column and possibly an attribute column to disambiguate if needed.\n", + " The long_names are not CF Standard Names, but rather are those documented at \n", + "the NCAR LENS website. For interoperability, the long_name column should be replaced by a cf_name column and possibly an attribute column to disambiguate if needed.\n", "
" ] }, @@ -301,9 +342,23 @@ "metadata": {}, "outputs": [], "source": [ - "uniques = coll.unique(columns=['long_name'])\n", - "nameList = sorted(uniques['long_name']['values'])\n", - "print(*nameList, sep='\\n') #note *list to unpack each item for print function" + "nameList = coll.unique()['long_name']\n", + "nameList.sort()\n", + "print(*nameList, sep='\\n')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Search capabilities" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use an `intake-esm` catalog object's `search` function in several ways: " ] }, { @@ -320,7 +375,25 @@ "outputs": [], "source": [ "myName = 'Salinity'\n", - "HTML(coll.search(long_name=myName).df.to_html(index=False))" + "df = coll.search(long_name=myName).df\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Search based on multiple criteria:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = coll.search(experiment=['20C','RCP85'], dim='3D', variable=['T','Q']).df\n", + "df" ] }, { @@ -334,7 +407,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The current version of the Intake-ESM `.search()` function requires an exact full-string case-sensitive match of `long_name`. (This has been reported as an issue at [https://github.com/NCAR/cesm-lens-aws/issues/48](https://github.com/NCAR/cesm-lens-aws/issues/48)). Demonstrate a work-around: find all variables with a particular substring in the long name" + "In some cases, you may not know the exact term to look for. For such cases, inkake-esm supports searching for substring matches. With use of wildcards and/or regular expressions, we can find all items with a particular substring in a given column. Let’s search for:\n", + "- entries from experiment = ‘20C’\n", + "- all entries whose variable long name contains wind" ] }, { @@ -343,18 +418,23 @@ "metadata": {}, "outputs": [], "source": [ - "myTerm = 'Wind'\n", - "myTerm = myTerm.lower() #search regardless of case\n", - "partials = [name for name in nameList if myTerm in name.lower()]\n", - "print(f\"All datasets with name containing {myTerm}:\")\n", - "print(*partials, sep='\\n')" + "coll_subset = coll.search(experiment=\"20C\", long_name=\"Wind*\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "coll_subset.df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Display full table for each match (could be lengthy if many matches):" + "If we wanted to search for Wind and wind, we can take advantage of [regular expression](https://docs.python.org/3/library/re.html) syntax to do so:" ] }, { @@ -363,22 +443,16 @@ "metadata": {}, "outputs": [], "source": [ - "for name in partials:\n", - " df = coll.search(long_name=name).df[['component', 'dim', 'experiment', 'variable', 'long_name']]\n", - " HTML(df.to_html(index=False))\n", - " ###df.head(1) #show only first entry in each group for compactness\n", - " # Note: It is also possible to hide column(s) instead of specifying desired columns\n", - " ###coll.search(long_name=name).df.drop(columns=['path'])" + "coll_subset = coll.search(experiment=\"20C\" , long_name=\"[Ww]ind*\")" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": {}, + "outputs": [], "source": [ - "
\n", - "

Warning

\n", - " The case-insensitive substring matching is not integrated into Intake-ESM, so it is not clear whether resulting search results can be passed directly to Xarray to read data.\n", - "
" + "coll_subset.df" ] }, { @@ -402,7 +476,7 @@ "outputs": [], "source": [ "df = coll.search(dim=\"3D\",component=\"ocn\").df\n", - "HTML(df.to_html(index=False))" + "df" ] }, { @@ -430,7 +504,7 @@ "outputs": [], "source": [ "df = coll.search(dim=\"3D\",component=\"ocn\", end='2100-12').df\n", - "HTML(df.to_html(index=False))" + "df" ] }, { @@ -456,7 +530,9 @@ "metadata": {}, "source": [ "## Resources and references\n", - "[Original notebook in the Pangeo Gallery](https://gallery.pangeo.io/repos/NCAR/cesm-lens-aws/notebooks/EnhancedIntakeCatalogDemo.html)" + "[Original notebook in the Pangeo Gallery](https://gallery.pangeo.io/repos/NCAR/cesm-lens-aws/notebooks/EnhancedIntakeCatalogDemo.html)\n", + "\n", + "[Intake-esm documentation](https://intake-esm.readthedocs.io)" ] } ], @@ -476,7 +552,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.13" + "version": "3.9.15" }, "nbdime-conflicts": { "local_diff": [