xarray Logo

Introduction to Xarray


Overview

This notebook will introduce the basics of gridded, labeled data with Xarray. Since Xarray introduces additional abstractions on top of plain arrays of data, our goal is to show why these abstractions are useful and how they frequently lead to simpler, more robust code.

We’ll cover these topics:

  1. Create a DataArray, one of the core object types in Xarray

  2. Understand how to use named coordinates and metadata in a DataArray

  3. Combine individual DataArrays into a Dataset, the other core object type in Xarray

  4. Subset, slice, and interpolate the data using named coordinates

  5. Open netCDF data using XArray

  6. Basic subsetting and aggregation of a Dataset

  7. Brief introduction to plotting with Xarray

Prerequisites

Concepts

Importance

Notes

NumPy Basics

Necessary

Intermediate NumPy

Helpful

Familiarity with indexing and slicing arrays

NumPy Broadcasting

Helpful

Familiar with array arithmetic and broadcasting

Introduction to Pandas

Helpful

Familiarity with labeled data

Datetime

Helpful

Familiarity with time formats and the timedelta object

Understanding of NetCDF

Helpful

Familiarity with metadata structure

  • Time to learn: 30 minutes


Imports

Simmilar to numpy, np; pandas, pd; you may often encounter xarray imported within a shortened namespace as xr.

from datetime import timedelta

import numpy as np
import pandas as pd
import xarray as xr

from bokeh.models.formatters import DatetimeTickFormatter
import hvplot.xarray
import holoviews as hv
hv.extension("bokeh")

Introducing the DataArray and Dataset

Xarray expands on the capabilities on NumPy arrays, providing a lot of streamlined data manipulation. It is similar in that respect to Pandas, but whereas Pandas excels at working with tabular data, Xarray is focused on N-dimensional arrays of data (i.e. grids). Its interface is based largely on the netCDF data model (variables, attributes, and dimensions), but it goes beyond the traditional netCDF interfaces to provide functionality similar to netCDF-java’s Common Data Model (CDM).

Creation of a DataArray object

The DataArray is one of the basic building blocks of Xarray (see docs here). It provides a numpy.ndarray-like object that expands to provide two critical pieces of functionality:

  1. Coordinate names and values are stored with the data, making slicing and indexing much more powerful

  2. It has a built-in container for attributes

Here we’ll initialize a DataArray object by wrapping a plain NumPy array, and explore a few of its properties.

Generate a random numpy array

For our first example, we’ll just create a random array of “temperature” data in units of Kelvin:

data = 283 + 5 * np.random.randn(5, 3, 4)
data
array([[[293.32608995, 287.75520854, 278.24368034, 289.93349129],
        [279.54926809, 292.58999623, 289.09722696, 295.32393659],
        [286.39552919, 292.62472063, 285.59258638, 279.5746119 ]],

       [[275.22541477, 282.256551  , 278.40045095, 284.60149592],
        [278.20326077, 280.14655679, 279.93812011, 288.63902218],
        [283.54043822, 288.61230464, 287.60904578, 280.82710424]],

       [[280.64538027, 286.82463884, 279.02115827, 279.35974271],
        [271.30562762, 278.36757438, 280.80482198, 281.46167131],
        [287.28502386, 286.33635438, 280.75888695, 285.57650857]],

       [[277.2173685 , 290.53339717, 269.012194  , 285.18244165],
        [291.20111636, 274.80961831, 287.57426188, 285.57278336],
        [282.01410077, 290.57880715, 273.57407499, 288.25025788]],

       [[279.36387344, 289.33578832, 276.95374706, 284.28057509],
        [279.0713138 , 282.48567402, 287.18152603, 285.8043493 ],
        [289.46101083, 287.92493614, 294.59037036, 282.37363415]]])

Wrap the array: first attempt

Now we create a basic DataArray just by passing our plain data as input:

temp = xr.DataArray(data)
temp
<xarray.DataArray (dim_0: 5, dim_1: 3, dim_2: 4)>
array([[[293.32608995, 287.75520854, 278.24368034, 289.93349129],
        [279.54926809, 292.58999623, 289.09722696, 295.32393659],
        [286.39552919, 292.62472063, 285.59258638, 279.5746119 ]],

       [[275.22541477, 282.256551  , 278.40045095, 284.60149592],
        [278.20326077, 280.14655679, 279.93812011, 288.63902218],
        [283.54043822, 288.61230464, 287.60904578, 280.82710424]],

       [[280.64538027, 286.82463884, 279.02115827, 279.35974271],
        [271.30562762, 278.36757438, 280.80482198, 281.46167131],
        [287.28502386, 286.33635438, 280.75888695, 285.57650857]],

       [[277.2173685 , 290.53339717, 269.012194  , 285.18244165],
        [291.20111636, 274.80961831, 287.57426188, 285.57278336],
        [282.01410077, 290.57880715, 273.57407499, 288.25025788]],

       [[279.36387344, 289.33578832, 276.95374706, 284.28057509],
        [279.0713138 , 282.48567402, 287.18152603, 285.8043493 ],
        [289.46101083, 287.92493614, 294.59037036, 282.37363415]]])
Dimensions without coordinates: dim_0, dim_1, dim_2

Note two things:

  1. Xarray generates some basic dimension names for us (dim_0, dim_1, dim_2). We’ll improve this with better names in the next example.

  2. Wrapping the numpy array in a DataArray gives us a rich display in the notebook! (Try clicking the array symbol to expand or collapse the view)

Assign dimension names

Much of the power of Xarray comes from making use of named dimensions. So let’s add some more useful names! We can do that by passing an ordered list of names using the keyword argument dims:

temp = xr.DataArray(data, dims=['time', 'lat', 'lon'])
temp
<xarray.DataArray (time: 5, lat: 3, lon: 4)>
array([[[293.32608995, 287.75520854, 278.24368034, 289.93349129],
        [279.54926809, 292.58999623, 289.09722696, 295.32393659],
        [286.39552919, 292.62472063, 285.59258638, 279.5746119 ]],

       [[275.22541477, 282.256551  , 278.40045095, 284.60149592],
        [278.20326077, 280.14655679, 279.93812011, 288.63902218],
        [283.54043822, 288.61230464, 287.60904578, 280.82710424]],

       [[280.64538027, 286.82463884, 279.02115827, 279.35974271],
        [271.30562762, 278.36757438, 280.80482198, 281.46167131],
        [287.28502386, 286.33635438, 280.75888695, 285.57650857]],

       [[277.2173685 , 290.53339717, 269.012194  , 285.18244165],
        [291.20111636, 274.80961831, 287.57426188, 285.57278336],
        [282.01410077, 290.57880715, 273.57407499, 288.25025788]],

       [[279.36387344, 289.33578832, 276.95374706, 284.28057509],
        [279.0713138 , 282.48567402, 287.18152603, 285.8043493 ],
        [289.46101083, 287.92493614, 294.59037036, 282.37363415]]])
Dimensions without coordinates: time, lat, lon

This is already improved upon from a NumPy array, because we have names for each of the dimensions (or axes in NumPy parlance). Even better, we can take arrays representing the values for the coordinates for each of these dimensions and associate them with the data when we create the DataArray. We’ll see this in the next example.

Create a DataArray with named Coordinates

Make time and space coordinates

Here we will use Pandas to create an array of datetime data, which we will then use to create a DataArray with a named coordinate time.

times = pd.date_range('2018-01-01', periods=5)
times
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05'],
              dtype='datetime64[ns]', freq='D')

We’ll also create arrays to represent sample longitude and latitude:

lons = np.linspace(-120, -60, 4)
lats = np.linspace(25, 55, 3)

Initialize the DataArray with complete coordinate info

When we create the DataArray instance, we pass in the arrays we just created:

temp = xr.DataArray(data, coords=[times, lats, lons], dims=['time', 'lat', 'lon'])
temp
<xarray.DataArray (time: 5, lat: 3, lon: 4)>
array([[[293.32608995, 287.75520854, 278.24368034, 289.93349129],
        [279.54926809, 292.58999623, 289.09722696, 295.32393659],
        [286.39552919, 292.62472063, 285.59258638, 279.5746119 ]],

       [[275.22541477, 282.256551  , 278.40045095, 284.60149592],
        [278.20326077, 280.14655679, 279.93812011, 288.63902218],
        [283.54043822, 288.61230464, 287.60904578, 280.82710424]],

       [[280.64538027, 286.82463884, 279.02115827, 279.35974271],
        [271.30562762, 278.36757438, 280.80482198, 281.46167131],
        [287.28502386, 286.33635438, 280.75888695, 285.57650857]],

       [[277.2173685 , 290.53339717, 269.012194  , 285.18244165],
        [291.20111636, 274.80961831, 287.57426188, 285.57278336],
        [282.01410077, 290.57880715, 273.57407499, 288.25025788]],

       [[279.36387344, 289.33578832, 276.95374706, 284.28057509],
        [279.0713138 , 282.48567402, 287.18152603, 285.8043493 ],
        [289.46101083, 287.92493614, 294.59037036, 282.37363415]]])
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05
  * lat      (lat) float64 25.0 40.0 55.0
  * lon      (lon) float64 -120.0 -100.0 -80.0 -60.0

Set useful attributes

…and while we’re at it, we can also set some attribute metadata:

temp.attrs['units'] = 'kelvin'
temp.attrs['standard_name'] = 'air_temperature'

temp
<xarray.DataArray (time: 5, lat: 3, lon: 4)>
array([[[293.32608995, 287.75520854, 278.24368034, 289.93349129],
        [279.54926809, 292.58999623, 289.09722696, 295.32393659],
        [286.39552919, 292.62472063, 285.59258638, 279.5746119 ]],

       [[275.22541477, 282.256551  , 278.40045095, 284.60149592],
        [278.20326077, 280.14655679, 279.93812011, 288.63902218],
        [283.54043822, 288.61230464, 287.60904578, 280.82710424]],

       [[280.64538027, 286.82463884, 279.02115827, 279.35974271],
        [271.30562762, 278.36757438, 280.80482198, 281.46167131],
        [287.28502386, 286.33635438, 280.75888695, 285.57650857]],

       [[277.2173685 , 290.53339717, 269.012194  , 285.18244165],
        [291.20111636, 274.80961831, 287.57426188, 285.57278336],
        [282.01410077, 290.57880715, 273.57407499, 288.25025788]],

       [[279.36387344, 289.33578832, 276.95374706, 284.28057509],
        [279.0713138 , 282.48567402, 287.18152603, 285.8043493 ],
        [289.46101083, 287.92493614, 294.59037036, 282.37363415]]])
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05
  * lat      (lat) float64 25.0 40.0 55.0
  * lon      (lon) float64 -120.0 -100.0 -80.0 -60.0
Attributes:
    units:          kelvin
    standard_name:  air_temperature

Attributes are not preserved by default!

Notice what happens if we perform a mathematical operaton with the DataArray: the coordinate values persist, but the attributes are lost. This is done because it is very challenging to know if the attribute metadata is still correct or appropriate after arbitrary arithmetic operations.

To illustrate this, we’ll do a simple unit conversion from Kelvin to Celsius:

temp_in_celsius = temp - 273.15
temp_in_celsius
<xarray.DataArray (time: 5, lat: 3, lon: 4)>
array([[[20.17608995, 14.60520854,  5.09368034, 16.78349129],
        [ 6.39926809, 19.43999623, 15.94722696, 22.17393659],
        [13.24552919, 19.47472063, 12.44258638,  6.4246119 ]],

       [[ 2.07541477,  9.106551  ,  5.25045095, 11.45149592],
        [ 5.05326077,  6.99655679,  6.78812011, 15.48902218],
        [10.39043822, 15.46230464, 14.45904578,  7.67710424]],

       [[ 7.49538027, 13.67463884,  5.87115827,  6.20974271],
        [-1.84437238,  5.21757438,  7.65482198,  8.31167131],
        [14.13502386, 13.18635438,  7.60888695, 12.42650857]],

       [[ 4.0673685 , 17.38339717, -4.137806  , 12.03244165],
        [18.05111636,  1.65961831, 14.42426188, 12.42278336],
        [ 8.86410077, 17.42880715,  0.42407499, 15.10025788]],

       [[ 6.21387344, 16.18578832,  3.80374706, 11.13057509],
        [ 5.9213138 ,  9.33567402, 14.03152603, 12.6543493 ],
        [16.31101083, 14.77493614, 21.44037036,  9.22363415]]])
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05
  * lat      (lat) float64 25.0 40.0 55.0
  * lon      (lon) float64 -120.0 -100.0 -80.0 -60.0

For an in-depth discussion of how Xarray handles metadata, start in the Xarray docs here.

The Dataset: a container for DataArrays with shared coordinates

Along with DataArray, the other key object type in Xarray is the Dataset: a dictionary-like container that holds one or more DataArrays, which can also optionally share coordinates (see docs here).

The most common way to create a Dataset object is to load data from a file (see below). Here, instead, we will create another DataArray and combine it with our temp data.

This will illustrate how the information about common coordinate axes is used.

Create a pressure DataArray using the same coordinates

This code mirrors how we created the temp object above.

pressure_data = 1000.0 + 5 * np.random.randn(5, 3, 4)
pressure = xr.DataArray(
    pressure_data, coords=[times, lats, lons], dims=['time', 'lat', 'lon']
)
pressure.attrs['units'] = 'hPa'
pressure.attrs['standard_name'] = 'air_pressure'

pressure
<xarray.DataArray (time: 5, lat: 3, lon: 4)>
array([[[1000.33791633, 1001.19218009,  998.89609449, 1000.42871059],
        [ 992.33385634, 1002.6960473 ,  989.34909384, 1007.5970938 ],
        [1000.75082377,  995.66007486, 1005.61665466, 1009.8584029 ]],

       [[ 996.11455997, 1007.63039344, 1006.95079257,  992.93506395],
        [ 998.85633855, 1009.43995918, 1004.4432504 ,  997.31938066],
        [1003.15206382,  995.66211731,  999.2432868 ,  998.24985149]],

       [[ 996.21818093,  998.38164719, 1002.49431886,  996.99196776],
        [ 990.60396219,  995.00025116, 1001.77752724,  998.09192565],
        [ 994.2928739 ,  998.10118042, 1000.87587336, 1008.17409797]],

       [[ 996.79875484,  995.49890777,  985.8944147 , 1007.75366839],
        [1000.66408566, 1004.08278271, 1007.04711552,  992.63289495],
        [1004.49742441,  992.04848754,  995.94999194, 1002.87426163]],

       [[1002.69133772, 1000.7453685 ,  999.60860392,  995.83223686],
        [ 994.28071322,  998.61217492, 1001.08441948,  998.06362308],
        [1000.08138087,  993.78511237, 1001.69660615,  992.0442204 ]]])
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05
  * lat      (lat) float64 25.0 40.0 55.0
  * lon      (lon) float64 -120.0 -100.0 -80.0 -60.0
Attributes:
    units:          hPa
    standard_name:  air_pressure

Create a Dataset object

Each DataArray in our Dataset needs a name!

The most straightforward way to create a Dataset with our temp and pressure arrays is to pass a dictionary using the keyword argument data_vars:

ds = xr.Dataset(data_vars={'Temperature': temp, 'Pressure': pressure})
ds
<xarray.Dataset>
Dimensions:      (time: 5, lat: 3, lon: 4)
Coordinates:
  * time         (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05
  * lat          (lat) float64 25.0 40.0 55.0
  * lon          (lon) float64 -120.0 -100.0 -80.0 -60.0
Data variables:
    Temperature  (time, lat, lon) float64 293.3 287.8 278.2 ... 294.6 282.4
    Pressure     (time, lat, lon) float64 1e+03 1.001e+03 ... 1.002e+03 992.0

Notice that the Dataset object ds is aware that both data arrays sit on the same coordinate axes.

Access Data variables and Coordinates in a Dataset

We can pull out any of the individual DataArray objects in a few different ways.

Using the “dot” notation:

ds.Pressure
<xarray.DataArray 'Pressure' (time: 5, lat: 3, lon: 4)>
array([[[1000.33791633, 1001.19218009,  998.89609449, 1000.42871059],
        [ 992.33385634, 1002.6960473 ,  989.34909384, 1007.5970938 ],
        [1000.75082377,  995.66007486, 1005.61665466, 1009.8584029 ]],

       [[ 996.11455997, 1007.63039344, 1006.95079257,  992.93506395],
        [ 998.85633855, 1009.43995918, 1004.4432504 ,  997.31938066],
        [1003.15206382,  995.66211731,  999.2432868 ,  998.24985149]],

       [[ 996.21818093,  998.38164719, 1002.49431886,  996.99196776],
        [ 990.60396219,  995.00025116, 1001.77752724,  998.09192565],
        [ 994.2928739 ,  998.10118042, 1000.87587336, 1008.17409797]],

       [[ 996.79875484,  995.49890777,  985.8944147 , 1007.75366839],
        [1000.66408566, 1004.08278271, 1007.04711552,  992.63289495],
        [1004.49742441,  992.04848754,  995.94999194, 1002.87426163]],

       [[1002.69133772, 1000.7453685 ,  999.60860392,  995.83223686],
        [ 994.28071322,  998.61217492, 1001.08441948,  998.06362308],
        [1000.08138087,  993.78511237, 1001.69660615,  992.0442204 ]]])
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05
  * lat      (lat) float64 25.0 40.0 55.0
  * lon      (lon) float64 -120.0 -100.0 -80.0 -60.0
Attributes:
    units:          hPa
    standard_name:  air_pressure

… or using dictionary access like this:

ds['Pressure']
<xarray.DataArray 'Pressure' (time: 5, lat: 3, lon: 4)>
array([[[1000.33791633, 1001.19218009,  998.89609449, 1000.42871059],
        [ 992.33385634, 1002.6960473 ,  989.34909384, 1007.5970938 ],
        [1000.75082377,  995.66007486, 1005.61665466, 1009.8584029 ]],

       [[ 996.11455997, 1007.63039344, 1006.95079257,  992.93506395],
        [ 998.85633855, 1009.43995918, 1004.4432504 ,  997.31938066],
        [1003.15206382,  995.66211731,  999.2432868 ,  998.24985149]],

       [[ 996.21818093,  998.38164719, 1002.49431886,  996.99196776],
        [ 990.60396219,  995.00025116, 1001.77752724,  998.09192565],
        [ 994.2928739 ,  998.10118042, 1000.87587336, 1008.17409797]],

       [[ 996.79875484,  995.49890777,  985.8944147 , 1007.75366839],
        [1000.66408566, 1004.08278271, 1007.04711552,  992.63289495],
        [1004.49742441,  992.04848754,  995.94999194, 1002.87426163]],

       [[1002.69133772, 1000.7453685 ,  999.60860392,  995.83223686],
        [ 994.28071322,  998.61217492, 1001.08441948,  998.06362308],
        [1000.08138087,  993.78511237, 1001.69660615,  992.0442204 ]]])
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05
  * lat      (lat) float64 25.0 40.0 55.0
  * lon      (lon) float64 -120.0 -100.0 -80.0 -60.0
Attributes:
    units:          hPa
    standard_name:  air_pressure

We’ll return to the Dataset object when we start loading data from files.

Subsetting and selection by coordinate values

Much of the power of labeled coordinates comes from the ability to select data based on coordinate names and values, rather than array indices. We’ll explore this briefly here.

NumPy-like selection

Suppose we want to extract all the spatial data for one single date: January 2, 2018. It’s possible to achieve that with NumPy-like index selection:

indexed_selection = temp[1, :, :]  # Index 1 along axis 0 is the time slice we want...
indexed_selection
<xarray.DataArray (lat: 3, lon: 4)>
array([[275.22541477, 282.256551  , 278.40045095, 284.60149592],
       [278.20326077, 280.14655679, 279.93812011, 288.63902218],
       [283.54043822, 288.61230464, 287.60904578, 280.82710424]])
Coordinates:
    time     datetime64[ns] 2018-01-02
  * lat      (lat) float64 25.0 40.0 55.0
  * lon      (lon) float64 -120.0 -100.0 -80.0 -60.0
Attributes:
    units:          kelvin
    standard_name:  air_temperature

HOWEVER, notice that this requires us (the user / programmer) to have detailed knowledge of the order of the axes and the meaning of the indices along those axes!

Named coordinates free us from this burden…

Selecting with .sel()

We can instead select data based on coordinate values using the .sel() method, which takes one or more named coordinate(s) as keyword argument:

named_selection = temp.sel(time='2018-01-02')
named_selection
<xarray.DataArray (lat: 3, lon: 4)>
array([[275.22541477, 282.256551  , 278.40045095, 284.60149592],
       [278.20326077, 280.14655679, 279.93812011, 288.63902218],
       [283.54043822, 288.61230464, 287.60904578, 280.82710424]])
Coordinates:
    time     datetime64[ns] 2018-01-02
  * lat      (lat) float64 25.0 40.0 55.0
  * lon      (lon) float64 -120.0 -100.0 -80.0 -60.0
Attributes:
    units:          kelvin
    standard_name:  air_temperature

We got the same result, but

  • we didn’t have to know anything about how the array was created or stored

  • our code is agnostic about how many dimensions we are dealing with

  • the intended meaning of our code is much clearer!

Approximate selection and interpolation

With time and space data, we frequently want to sample “near” the coordinate points in our dataset. Here are a few simple ways to achieve that.

Nearest-neighbor sampling

Suppose we want to sample the nearest datapoint within 2 days of date 2018-01-07. Since the last day on our time axis is 2018-01-05, this is well-posed.

.sel has the flexibility to perform nearest neighbor sampling, taking an optional tolerance:

temp.sel(time='2018-01-07', method='nearest', tolerance=timedelta(days=2))
<xarray.DataArray (lat: 3, lon: 4)>
array([[279.36387344, 289.33578832, 276.95374706, 284.28057509],
       [279.0713138 , 282.48567402, 287.18152603, 285.8043493 ],
       [289.46101083, 287.92493614, 294.59037036, 282.37363415]])
Coordinates:
    time     datetime64[ns] 2018-01-05
  * lat      (lat) float64 25.0 40.0 55.0
  * lon      (lon) float64 -120.0 -100.0 -80.0 -60.0
Attributes:
    units:          kelvin
    standard_name:  air_temperature

where we see that .sel indeed pulled out the data for date 2018-01-05.

Interpolation

Suppose we want to extract a timeseries for Boulder (40°N, 105°W). Since lon=-105 is not a point on our longitude axis, this requires interpolation between data points.

The .interp() method (see the docs here) works similarly to .sel(). Using .interp(), we can interpolate to any latitude/longitude location:

temp.interp(lon=-105, lat=40)
<xarray.DataArray (time: 5)>
array([289.32981419, 279.66073278, 276.60208769, 278.90749282,
       281.63208396])
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05
    lon      int64 -105
    lat      int64 40
Attributes:
    units:          kelvin
    standard_name:  air_temperature

Info

Xarray’s interpolation functionality requires the SciPy package!

Slicing along coordinates

Frequently we want to select a range (or slice) along one or more coordinate(s). We can achieve this by passing a Python slice object to .sel(), as follows:

temp.sel(
    time=slice('2018-01-01', '2018-01-03'), lon=slice(-110, -70), lat=slice(25, 45)
)
<xarray.DataArray (time: 3, lat: 2, lon: 2)>
array([[[287.75520854, 278.24368034],
        [292.58999623, 289.09722696]],

       [[282.256551  , 278.40045095],
        [280.14655679, 279.93812011]],

       [[286.82463884, 279.02115827],
        [278.36757438, 280.80482198]]])
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-02 2018-01-03
  * lat      (lat) float64 25.0 40.0
  * lon      (lon) float64 -100.0 -80.0
Attributes:
    units:          kelvin
    standard_name:  air_temperature

Info

The calling sequence for slice always looks like slice(start, stop[, step]), where step is optional.

Notice how the length of each coordinate axis has changed due to our slicing.

One more selection method: .loc

All of these operations can also be done within square brackets on the .loc attribute of the DataArray:

temp.loc['2018-01-02']
<xarray.DataArray (lat: 3, lon: 4)>
array([[275.22541477, 282.256551  , 278.40045095, 284.60149592],
       [278.20326077, 280.14655679, 279.93812011, 288.63902218],
       [283.54043822, 288.61230464, 287.60904578, 280.82710424]])
Coordinates:
    time     datetime64[ns] 2018-01-02
  * lat      (lat) float64 25.0 40.0 55.0
  * lon      (lon) float64 -120.0 -100.0 -80.0 -60.0
Attributes:
    units:          kelvin
    standard_name:  air_temperature

This is sort of in between the NumPy-style selection

temp[1,:,:]

and the fully label-based selection using .sel()

With .loc, we make use of the coordinate values, but lose the ability to specify the names of the various dimensions. Instead, the slicing must be done in the correct order:

temp.loc['2018-01-01':'2018-01-03', 25:45, -110:-70]
<xarray.DataArray (time: 3, lat: 2, lon: 2)>
array([[[287.75520854, 278.24368034],
        [292.58999623, 289.09722696]],

       [[282.256551  , 278.40045095],
        [280.14655679, 279.93812011]],

       [[286.82463884, 279.02115827],
        [278.36757438, 280.80482198]]])
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-02 2018-01-03
  * lat      (lat) float64 25.0 40.0
  * lon      (lon) float64 -100.0 -80.0
Attributes:
    units:          kelvin
    standard_name:  air_temperature

One advantage of using .loc is that we can use NumPy-style slice notation like 25:45, rather than the more verbose slice(25,45). But of course that also works:

temp.loc['2018-01-01':'2018-01-03', slice(25, 45), -110:-70]
<xarray.DataArray (time: 3, lat: 2, lon: 2)>
array([[[287.75520854, 278.24368034],
        [292.58999623, 289.09722696]],

       [[282.256551  , 278.40045095],
        [280.14655679, 279.93812011]],

       [[286.82463884, 279.02115827],
        [278.36757438, 280.80482198]]])
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-02 2018-01-03
  * lat      (lat) float64 25.0 40.0
  * lon      (lon) float64 -100.0 -80.0
Attributes:
    units:          kelvin
    standard_name:  air_temperature

What doesn’t work is passing the slices in a different order:

# This will generate an error
# temp.loc[-110:-70, 25:45,'2018-01-01':'2018-01-03']

Opening netCDF data

With its close ties to the netCDF data model, Xarray also supports netCDF as a first-class file format. This means it has easy support for opening netCDF datasets, so long as they conform to some of Xarray’s limitations (such as 1-dimensional coordinates).

Access netCDF data with xr.open_dataset

Once we have a valid path to a data file that Xarray knows how to read, we can open it like this:

ds = xr.open_dataset("../data/sample_kazr_data.nc")
ds
/Users/mgrover/miniforge3/envs/pyart-docs/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
<xarray.Dataset>
Dimensions:                (time: 6945, range: 600)
Coordinates:
  * time                   (time) datetime64[ns] 2022-03-14T00:00:01.38023799...
  * range                  (range) float32 100.7 130.7 ... 1.803e+04 1.806e+04
    azimuth                (time) float32 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
    elevation              (time) float32 90.0 90.0 90.0 90.0 ... 90.0 90.0 90.0
Data variables:
    temp                   (time, range) float32 ...
    reflectivity           (time, range) float32 ...
    mean_doppler_velocity  (time, range) float32 ...
Attributes: (12/28)
    command_line:             idl -R -n kazrcfrcor -s guc -f M1 -b 20220314
    Conventions:              ARM-1.2 CF/Radial-1.4 instrument_parameters rad...
    process_version:          vap-kazrcfrcor-1.4-0.el7
    dod_version:              kazrcfrcorge-c0-1.4
    input_datastreams:        guckazrcfrgeM1.a1 : 1.3 : 20220313.230007-20220...
    site_id:                  guc
    ...                       ...
    range_offset_ch1:           -1.4 m
    range_offset_ch2:           70.7 m
    software_version:         1.7.6 (Wed Mar 23 17:10:35 UTC 2016 leachman
    title:                    ARM KAZR Corrected Moments
    doi:                      10.5439/1560129
    history:                  created by user dsmgr on machine flint at 2022-...

Subsetting the Dataset

Our call to xr.open_dataset() above returned a Dataset object that we’ve decided to call ds. We can then pull out individual fields:

ds.reflectivity
<xarray.DataArray 'reflectivity' (time: 6945, range: 600)>
[4167000 values with dtype=float32]
Coordinates:
  * time       (time) datetime64[ns] 2022-03-14T00:00:01.380237999 ... 2022-0...
  * range      (range) float32 100.7 130.7 160.6 ... 1.8e+04 1.803e+04 1.806e+04
    azimuth    (time) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
    elevation  (time) float32 90.0 90.0 90.0 90.0 90.0 ... 90.0 90.0 90.0 90.0
Attributes:
    long_name:            Equivalent reflectivity factor
    units:                dBZ
    ancillary_variables:  qc_reflectivity significant_detection_mask
    resolution:           0.001
    standard_name:        equivalent_reflectivity_factor
    comment:              To unpack field, multiply values by the scale_facto...

(recall that we can also use dictionary syntax like ds['isobaric1'] to do the same thing)

Datasets also support much of the same subsetting operations as DataArray, but will perform the operation on all data:

first_hour = ds.sel(time=slice("2022-03-14T00:00:00", "2022-03-14T05:0:00"))
first_hour
<xarray.Dataset>
Dimensions:                (time: 6945, range: 600)
Coordinates:
  * time                   (time) datetime64[ns] 2022-03-14T00:00:01.38023799...
  * range                  (range) float32 100.7 130.7 ... 1.803e+04 1.806e+04
    azimuth                (time) float32 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
    elevation              (time) float32 90.0 90.0 90.0 90.0 ... 90.0 90.0 90.0
Data variables:
    temp                   (time, range) float32 ...
    reflectivity           (time, range) float32 ...
    mean_doppler_velocity  (time, range) float32 ...
Attributes: (12/28)
    command_line:             idl -R -n kazrcfrcor -s guc -f M1 -b 20220314
    Conventions:              ARM-1.2 CF/Radial-1.4 instrument_parameters rad...
    process_version:          vap-kazrcfrcor-1.4-0.el7
    dod_version:              kazrcfrcorge-c0-1.4
    input_datastreams:        guckazrcfrgeM1.a1 : 1.3 : 20220313.230007-20220...
    site_id:                  guc
    ...                       ...
    range_offset_ch1:           -1.4 m
    range_offset_ch2:           70.7 m
    software_version:         1.7.6 (Wed Mar 23 17:10:35 UTC 2016 leachman
    title:                    ARM KAZR Corrected Moments
    doi:                      10.5439/1560129
    history:                  created by user dsmgr on machine flint at 2022-...

And further subsetting to a single DataArray:

first_hour.reflectivity
<xarray.DataArray 'reflectivity' (time: 6945, range: 600)>
[4167000 values with dtype=float32]
Coordinates:
  * time       (time) datetime64[ns] 2022-03-14T00:00:01.380237999 ... 2022-0...
  * range      (range) float32 100.7 130.7 160.6 ... 1.8e+04 1.803e+04 1.806e+04
    azimuth    (time) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
    elevation  (time) float32 90.0 90.0 90.0 90.0 90.0 ... 90.0 90.0 90.0 90.0
Attributes:
    long_name:            Equivalent reflectivity factor
    units:                dBZ
    ancillary_variables:  qc_reflectivity significant_detection_mask
    resolution:           0.001
    standard_name:        equivalent_reflectivity_factor
    comment:              To unpack field, multiply values by the scale_facto...

Aggregation operations

Not only can you use the named dimensions for manual slicing and indexing of data, but you can also use it to control aggregation operations, like std (standard deviation):

reflectivity = ds['reflectivity']
reflectivity.std(dim=['range'])
<xarray.DataArray 'reflectivity' (time: 6945)>
array([10.319601, 10.394982, 10.233433, ..., 11.007545, 10.772666,
       10.766087], dtype=float32)
Coordinates:
  * time       (time) datetime64[ns] 2022-03-14T00:00:01.380237999 ... 2022-0...
    azimuth    (time) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
    elevation  (time) float32 90.0 90.0 90.0 90.0 90.0 ... 90.0 90.0 90.0 90.0

Info

Aggregation methods for Xarray objects operate over the named coordinate dimension(s) specified by keyword argument dim. Compare to NumPy, where aggregations operate over specified numbered axes.

Using the sample dataset, we can calculate the temperature profile across our time period!

temps = ds.temp
temps_lowest_5000m = temps.sel(range=slice(0., 5000))
prof = temps_lowest_5000m.mean(dim="time")
prof
<xarray.DataArray 'temp' (range: 164)>
array([ -0.51454675,  -0.9053056 ,  -1.0829703 ,  -1.4716767 ,
        -1.6830121 ,  -2.054755  ,  -2.2459652 ,  -2.6187873 ,
        -2.8358402 ,  -3.2145638 ,  -3.4060874 ,  -3.805035  ,
        -3.988154  ,  -4.3917727 ,  -4.574227  ,  -4.971466  ,
        -5.1385016 ,  -5.5681195 ,  -5.5681195 ,  -6.0736923 ,
        -6.449356  ,  -6.449356  ,  -6.989866  ,  -6.989866  ,
        -7.4487042 ,  -7.958132  ,  -7.958132  ,  -8.46593   ,
        -8.46593   ,  -8.938426  ,  -9.372833  ,  -9.372833  ,
        -9.825161  ,  -9.825161  , -10.312861  , -10.835717  ,
       -10.835717  , -11.321515  , -11.321515  , -11.299675  ,
       -11.667103  , -11.667103  , -11.930359  , -11.930359  ,
       -12.142204  , -12.498228  , -12.498228  , -12.820213  ,
       -12.820213  , -13.252464  , -13.695308  , -13.695308  ,
       -14.17931   , -14.17931   , -14.523217  , -14.839883  ,
       -14.839883  , -15.140017  , -15.140017  , -15.493524  ,
       -15.834648  , -15.834648  , -16.178988  , -16.178988  ,
       -16.54406   , -16.919527  , -16.919527  , -17.297638  ,
       -17.297638  , -17.297638  , -18.052618  , -18.052618  ,
       -18.052618  , -18.681488  , -18.681488  , -18.681488  ,
       -19.483578  , -19.483578  , -19.483578  , -19.483578  ,
...
       -21.316685  , -21.316685  , -21.73009   , -21.73009   ,
       -21.73009   , -22.307793  , -22.307793  , -22.307793  ,
       -23.056702  , -23.056702  , -23.056702  , -23.056702  ,
       -23.69746   , -23.69746   , -23.69746   , -24.580833  ,
       -24.580833  , -24.580833  , -25.405659  , -25.405659  ,
       -25.405659  , -25.405659  , -25.99787   , -25.99787   ,
       -25.99787   , -26.740292  , -26.740292  , -26.740292  ,
       -27.575071  , -27.575071  , -27.575071  , -27.575071  ,
       -28.335821  , -28.335821  , -28.335821  , -29.146078  ,
       -29.146078  , -29.146078  , -29.926023  , -29.926023  ,
       -29.926023  , -29.926023  , -30.69776   , -30.69776   ,
       -30.69776   , -31.382189  , -31.382189  , -31.382189  ,
       -31.382189  , -31.382189  , -32.53377   , -32.53377   ,
       -32.53377   , -32.53377   , -32.53377   , -32.53377   ,
       -32.53377   , -33.788513  , -33.788513  , -33.788513  ,
       -33.788513  , -33.788513  , -33.788513  , -35.15842   ,
       -35.15842   , -35.15842   , -35.15842   , -35.15842   ,
       -35.15842   , -35.15842   , -36.772747  , -36.772747  ,
       -36.772747  , -36.772747  , -36.772747  , -36.772747  ],
      dtype=float32)
Coordinates:
  * range    (range) float32 100.7 130.7 160.6 ... 4.927e+03 4.957e+03 4.987e+03

Plotting with Xarray

Another major benefit of using labeled data structures is that they enable automated plotting with sensible axis labels.

Simple visualization with .plot()

Much like we saw in Pandas, Xarray includes an interface to Matplotlib that we can access through the .plot() method of every DataArray.

For quick and easy data exploration, we can just call .plot() without any modifiers:

prof.plot();
../../_images/xarray-intro_80_0.png

Here Xarray has generated a line plot of the temperature data against the coordinate variable isobaric. Also the metadata are used to auto-generate axis labels and units.

Customizing the plot

As in Pandas, the .plot() method is mostly just a wrapper to Matplotlib, so we can customize our plot in familiar ways.

In this air temperature profile example, we would like to make two changes:

  • swap the axes so that we have isobaric levels on the y (vertical) axis of the figure

  • make pressure decrease upward in the figure, so that up is up

A few keyword arguments to our .plot() call will take care of this:

prof.plot(y="range")
[<matplotlib.lines.Line2D at 0x2930010c0>]
../../_images/xarray-intro_83_1.png

Plotting 2D data

In the example above, the .plot() method produced a line plot.

What if we call .plot() on a 2D array?

temps.sel(range=slice(0, 5000)).plot(y='range', cmap='Spectral_r');
../../_images/xarray-intro_85_0.png

We can also make this interactive!

temps.sel(range=slice(0, 5000)).hvplot(x='time', y='range', cmap='Spectral_r', rasterize=True)
ds.reflectivity.sel(range=slice(0, 5000)).plot(y='range', cmap='Spectral_r');
../../_images/xarray-intro_88_0.png
ds.reflectivity.sel(range=slice(0, 5000)).hvplot(x='time', y='range', cmap='Spectral_r', rasterize=True)

Customize our Interactive Plots

Our time axis doesn’t tell us much… we can change that! Also note that we add additional parameters to customize our view of the field.

formatter = DatetimeTickFormatter(hours="%d %b %Y \n %H:%M UTC")
reflectivity_plot = ds.reflectivity.sel(range=slice(0, 5000)).hvplot(x='time', y='range', cmap='Spectral_r', xformatter=formatter, clim=(-20, 40), rasterize=True)
reflectivity_plot

And the same for velocity…

velocity_plot = ds.mean_doppler_velocity.sel(range=slice(0, 5000)).hvplot(x='time', y='range', cmap='seismic', xformatter=formatter, clim=(-5, 5), rasterize=True)
velocity_plot

Combine our Plots

Now that we have our interactive plots, we can combine them using +

reflectivity_plot + velocity_plot

Or stacked on top of each other…

(reflectivity_plot + velocity_plot).cols(1)

Xarray has recognized that the DataArray object calling the plot method has two coordinate variables, and generates a 2D plot using the pcolormesh method from Matplotlib.

In this case, we are looking at air temperatures on the 1000 hPa isobaric surface over North America. We could of course improve this figure by using Cartopy to handle the map projection and geographic features!


Summary

Xarray brings the joy of Pandas-style labeled data operations to N-dimensional data. As such, it has become a central workhorse in the geoscience community for the analysis of gridded datasets. Xarray allows us to open self-describing NetCDF files and make full use of the coordinate axes, labels, units, and other metadata. By making use of labeled coordinates, our code is often easier to write, easier to read, and more robust.

We also covered some interactive plots using xarray and hvPlot!

What’s next?

Additional notebooks to appear in this section will go into more detail about

  • arithemtic and broadcasting with Xarray data structures

  • using “group by” operations

  • remote data access with OpenDAP

  • more advanced visualization including map integration with Cartopy

Resources and references

This notebook was adapated from material in Unidata’s Python Training.

The best resource for Xarray is the Xarray documentation. See in particular

Another excellent resource is this Xarray Tutorial collection.