Introduction to intake-esm

This notebook demonstrates how to access Google Cloud CMIP6 data using intake-esm.

Intake-esm is a data cataloging utility built on top of intake, pandas, and xarray. Intake-esm aims to facilitate:

the discovery of earth’s climate and weather datasets.
the ingestion of these datasets into xarray dataset containers.

Imports¶

It’s basic usage is shown below. To begin, let’s import intake:

import intake

Load the Catalog¶

At import time, intake-esm plugin is available in intake’s registry as esm_datastore and can be accessed with intake.open_esm_datastore() function. Use the intake_esm.tutorial.get_url() method to access smaller subsetted catalogs for tutorial purposes.

import intake_esm
url = intake_esm.tutorial.get_url('google_cmip6')
print(url)
cat = intake.open_esm_datastore(url)
cat

The summary above tells us that this catalog contains 261 data assets. We can get more information on the individual data assets contained in the catalog by looking at the underlying dataframe created when we load the catalog:

cat.df.head()

The first data asset listed in the catalog contains:

the Northward Wind (variable_id=‘va’), as a function of latitude, longitude, time,
the latest version of the IPSL climate model (source_id=‘IPSL-CM6A-LR’),
hindcasts initialized from observations with historical forcing (experiment_id=‘historical’),
developed by theInstitut Pierre Simon Laplace (instution_id=‘IPSL’),
run as part of the Coupled Model Intercomparison Project (activity_id=‘CMIP’)

And is located in Google Cloud Storage at ‘gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r2i1p1f1/Amon/va/gr/v20180803/’.

Finding unique entries¶

To get unique values for given columns in the catalog, intake-esm provides a ~intake_esm.core.esm_datastore.unique method:

Let’s query the data catalog to see what models(source_id), experiments (experiment_id) and temporal frequencies (table_id) are available.

unique = cat.unique()
unique

activity_id                                                       [CMIP]
institution_id                                             [IPSL, CCCma]
source_id                                        [IPSL-CM6A-LR, CanESM5]
experiment_id                                               [historical]
member_id              [r2i1p1f1, r8i1p1f1, r30i1p1f1, r29i1p1f1, r3i...
table_id                                                     [Amon, Oyr]
variable_id                                                 [va, ua, o2]
grid_label                                                      [gr, gn]
zstore                 [gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histo...
dcpp_init_year                                                        []
version                         [20180803, 20190429, 20190802, 20191204]
derived_variable_id                                                   []
dtype: object

unique['source_id']

['IPSL-CM6A-LR', 'CanESM5']

unique['experiment_id']

['historical']

unique['table_id']

['Amon', 'Oyr']

Search for specific datasets¶

The ~intake_esm.core.esm_datastore.search method allows the user to perform a query on a catalog using keyword arguments. The keyword argument names must match column names in the catalog. The search method returns a subset of the catalog with all the entries that match the provided query.

In the example below, we are are going to search for the following:

variable_d: o2 which stands for mole_concentration_of_dissolved_molecular_oxygen_in_sea_water
experiments: [‘historical’, ‘ssp585’]:
- historical: all forcing of the recent past.
- ssp585: emission-driven RCP8.5 based on SSP5.
table_id: 0yr which stands for annual mean variables on the ocean grid.
grid_label: gn which stands for data reported on a model’s native grid.

For more details on the CMIP6 vocabulary, please check this website, and Core Controlled Vocabularies (CVs) for use in CMIP6 GitHub repository.

cat_subset = cat.search(
    experiment_id=["historical", "ssp585"],
    table_id="Oyr",
    variable_id="o2",
    grid_label="gn",
)

cat_subset

Load datasets using `to_dataset_dict()`¶

Intake-esm implements convenience utilities for loading the query results into higher level xarray datasets. The logic for merging/concatenating the query results into higher level xarray datasets is provided in the input JSON file and is available under .aggregation_info property of the catalog:

cat.esmcat.aggregation_control

AggregationControl(variable_column_name='variable_id', groupby_attrs=['activity_id', 'institution_id', 'source_id', 'experiment_id', 'table_id', 'grid_label'], aggregations=[Aggregation(type=<AggregationType.union: 'union'>, attribute_name='variable_id', options={}), Aggregation(type=<AggregationType.join_new: 'join_new'>, attribute_name='member_id', options={'coords': 'minimal', 'compat': 'override'}), Aggregation(type=<AggregationType.join_new: 'join_new'>, attribute_name='dcpp_init_year', options={'coords': 'minimal', 'compat': 'override'})])

To load data assets into xarray datasets, we need to use the ~intake_esm.core.esm_datastore.to_dataset_dict method. This method returns a dictionary of aggregate xarray datasets as the name hints.

dset_dict = cat_subset.to_dataset_dict(
    xarray_open_kwargs={"consolidated": True, "decode_times": True, "use_cftime": True}
)

[key for key in dset_dict.keys()]

['CMIP.CCCma.CanESM5.historical.Oyr.gn',
 'CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn']

We can access a particular dataset as follows:

ds = dset_dict["CMIP.CCCma.CanESM5.historical.Oyr.gn"]
ds

Let’s create a quick plot for a slice of the data:

ds.o2.isel(time=0,
           lev=0,
           member_id=range(1, 24, 4)
          ).plot(col="member_id", col_wrap=3, robust=True)

<xarray.plot.facetgrid.FacetGrid at 0x163dd9210>

Use custom preprocessing functions¶

When comparing many models it is often necessary to preprocess (e.g. rename certain variables) them before running some analysis step. The preprocess argument lets the user pass a function, which is executed for each loaded asset before combining datasets.

cat_pp = cat.search(
    experiment_id=["historical"],
    table_id="Oyr",
    variable_id="o2",
    grid_label="gn",
    source_id=["IPSL-CM6A-LR", "CanESM5"],
    member_id="r10i1p1f1",
)
cat_pp.df

dset_dict_raw = cat_pp.to_dataset_dict(xarray_open_kwargs={"consolidated": True})

for k, ds in dset_dict_raw.items():
    print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")

Note that both models follow a different naming scheme. We can define a little helper function and pass it to .to_dataset_dict() to fix this. For demonstration purposes we will focus on the vertical level dimension which is called lev in CanESM5 and olevel in IPSL-CM6A-LR.

def helper_func(ds):
    """Rename `olevel` dim to `lev`"""
    ds = ds.copy()
    # a short example
    if "olevel" in ds.dims:
        ds = ds.rename({"olevel": "lev"})
    return ds

dset_dict_fixed = cat_pp.to_dataset_dict(xarray_open_kwargs={"consolidated": True}, preprocess=helper_func)

for k, ds in dset_dict_fixed.items():
    print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")

This was just an example for one dimension.

Check out xmip package for a full renaming function for all available CMIP6 models and some other utilities.

intake-esm vs. intake-esgf ESDS Presentation

Presentation Slides