Skip to article frontmatterSkip to article content

Chicago Fall Colors in Context: Using GHCNd Data!

Tonight, Ben Blaiszik (BenBlaiszik) mentioned the following:

Ben fall picture

So, I sat down to see if I could answer this question, using the Global Historical Climatology Network daily (GHCNd) dataset.

The Data

This dataset is commonly used for calculating climatological normals, with long records and worldwide spatial coverage. It is provided by NOAA, with a version stored on the cloud (Amazon Web Services).

Imports

We use calendar for months, core components of the scientific python stack, hvplot for visualization, Dask for parallel and distributed computing, and Cartopy for mapping.

import calendar

import numpy as np
import pandas as pd
import holoviews as hv
import hvplot.pandas
from distributed import LocalCluster
import cartopy.crs as ccrs
hv.extension('bokeh')
Loading...

Spin up a Cluster

This spins up a local Dask cluster to use for the analysis.

cluster = LocalCluster()
cluster
Loading...

Read in the GHCN Data

Let’s get to accessing the data.. the main documentation can be found here:

https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily

We are interested in:

  • the daily maximum temperature (TMAX)
  • the daily minimum temperature (TMIN)
  • the daily precipitation accumulation (PRCP)

The Stations Table

The first step to retrieving this dataset is determing which location to use. There is a table for this, but it requires some custom code specifications to load into our dataframe.

This code was provided by Ned Haughton, through his gitlab.

"""
File: read_ghcn.py
Author: Ned Haughton
Description: Code for reading GHCN dly file data
https://gitlab.com/-/snippets/1838910
"""

import pandas as pd


# Metadata specs #

metadata_col_specs = [
    (0,  12),
    (12, 21),
    (21, 31),
    (31, 38),
    (38, 41),
    (41, 72),
    (72, 76),
    (76, 80),
    (80, 86)
]

metadata_names = [
    "ID",
    "LATITUDE",
    "LONGITUDE",
    "ELEVATION",
    "STATE",
    "NAME",
    "GSN FLAG",
    "HCN/CRN FLAG",
    "WMO ID"]

metadata_dtype = {
    "ID": str,
    "STATE": str,
    "NAME": str,
    "GSN FLAG": str,
    "HCN/CRN FLAG": str,
    "WMO ID": str
    }


# Data specs #

data_header_names = [
    "ID",
    "YEAR",
    "MONTH",
    "ELEMENT"]

data_header_col_specs = [
    (0,  11),
    (11, 15),
    (15, 17),
    (17, 21)]

data_header_dtypes = {
    "ID": str,
    "YEAR": int,
    "MONTH": int,
    "ELEMENT": str}

data_col_names = [[
    "VALUE" + str(i + 1),
    "MFLAG" + str(i + 1),
    "QFLAG" + str(i + 1),
    "SFLAG" + str(i + 1)]
    for i in range(31)]
# Join sub-lists
data_col_names = sum(data_col_names, [])

data_replacement_col_names = [[
    ("VALUE", i + 1),
    ("MFLAG", i + 1),
    ("QFLAG", i + 1),
    ("SFLAG", i + 1)]
    for i in range(31)]
# Join sub-lists
data_replacement_col_names = sum(data_replacement_col_names, [])
data_replacement_col_names = pd.MultiIndex.from_tuples(
    data_replacement_col_names,
    names=['VAR_TYPE', 'DAY'])

data_col_specs = [[
    (21 + i * 8, 26 + i * 8),
    (26 + i * 8, 27 + i * 8),
    (27 + i * 8, 28 + i * 8),
    (28 + i * 8, 29 + i * 8)]
    for i in range(31)]
data_col_specs = sum(data_col_specs, [])

data_col_dtypes = [{
    "VALUE" + str(i + 1): int,
    "MFLAG" + str(i + 1): str,
    "QFLAG" + str(i + 1): str,
    "SFLAG" + str(i + 1): str}
    for i in range(31)]
data_header_dtypes.update({k: v for d in data_col_dtypes for k, v in d.items()})


# Reading functions #

def read_station_metadata(filename):
    """Reads in station metadata

    :filename: ghcnd station metadata file.
    :returns: station metadata as a pandas Dataframe

    """
    df = pd.read_fwf(filename, colspecs=metadata_col_specs, names=metadata_names,
                     index_col='ID', dtype=metadata_dtype)

    return df

Load the station data using the function

We can remotely read in the dataset using the following - no download required!

station_data = read_station_metadata("https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt")
station_data
Loading...

Visualize the GHCNd Dataset

Let’s visualize the spatial extent. We can use a spatial density map, setting rasterize=True, and adding our different features for reference.

station_data.hvplot.points(x='LONGITUDE',
                           y='LATITUDE',
                           rasterize=True,
                           geo=True,
                           features=['ocean', 'land', 'borders', 'lakes'],
                           projection=ccrs.PlateCarree(),
                           title='GHCNd Station Locations')
Loading...

Zoom in Around Chicago

Let’s zoom in around Chicago! This allows us to investigate specific stations around this area.

lat_min = 41
lat_max = 43
lon_min = -89
lon_max = -87

stations_around_chicago = station_data[(station_data.LATITUDE > lat_min) & 
                                       (station_data.LATITUDE < lat_max) & 
                                       (station_data.LONGITUDE < lon_max) &
                                       (station_data.LONGITUDE > lon_min)]
stations_around_chicago
Loading...
stations_around_chicago.hvplot.points(x='LONGITUDE',
                                      y='LATITUDE',
                                      geo=True,
                                      features=['ocean', 'land', 'borders', 'lakes'],
                                      projection=ccrs.PlateCarree(),
                                      hover_cols=['ID', 'NAME'],
                                      title='GHCNd Stations Around Chicago')
Loading...

We will use the O’Hare International Airport site here, USW00094846, which is second to last value in the table.


Access Station Data from AWS

Now that we have our station decided, we can load our data from AWS.

Here is the landing page for their documentation

https://registry.opendata.aws/noaa-ghcn/

Finding our Data

The directories in the bucket are configured as follows:

s3://noaa-ghcn-pds/parquet/by_station/STATION={INSERT_STATION_ID}/

for example,

s3://noaa-ghcn-pds/parquet/by_station/STATION=USW00094846/

is the path to the file for Chicago O’Hare.

Read our Data using Pandas

Now that we found our data, we can access it using the following, setting our station ID to USW00094846.

station = 'USW00094846'
ohare = pd.read_parquet(f's3://noaa-ghcn-pds/parquet/by_station/STATION={station}/',
                        storage_options={'anon':True})
ohare
Loading...

Clean up the Data

Now, we can clean the data. We

  • Convert the date column to a datetime column
  • Subset for our variables, and assign to their own dataframe
  • Rename the columns
  • Convert from tenths of degrees Celsius to degrees Celsius (divide by 10)
# Clean up the dates
ohare['DATE'] = pd.to_datetime(ohare.DATE)

# Subset for tmax, tmin, and precipitation
tmax = ohare.loc[ohare.ELEMENT == 'TMAX']
tmin = ohare.loc[ohare.ELEMENT == 'TMIN']
precipitation = ohare.loc[ohare.ELEMENT == 'PRCP']

# Rename the columns, set the index to the date
tmax_df = tmax[['DATE', 'DATA_VALUE']].rename(columns={'DATA_VALUE': 'TMAX'}).set_index('DATE')
tmin_df = tmin[['DATE', 'DATA_VALUE']].rename(columns={'DATA_VALUE': 'TMIN'}).set_index('DATE')
precip_df = precipitation[['DATE', 'DATA_VALUE']].rename(columns={'DATA_VALUE': 'PRCP'}).set_index('DATE')

# Convert to degrees Celsius
tmax_df['TMAX'] = tmax_df.TMAX/10
tmin_df['TMIN'] = tmin_df.TMIN/10

Calculate and Visualize the Temperature and Precipitation Fields

Now that we have a clean dataset, we setup the following function to calculate the normal, using the 1990-2020 time frame as the default reference, and compares a given year to this normal.

This gives us a frame of reference for how “normal” the climatological field is.

def plot_monthly_stats(df, year=2022, ylabel=None, first_year=1990, last_year=2020):
    """
    Helper function to calculate the averages and plot our data
    """
    
    if first_year is None:
        first_year = df.index.year.min()
        
    if last_year is None:
        last_year = df.index.year.max()
    
    # Grab the month names from the calendar module
    months = list(calendar.month_name)[1:]
    
    # Subset for the averaging period
    df_normal = df[(df.index.year >= first_year) & (df.index.year <= last_year)]
    
    # Calculate the average monthly values over the entire dataset
    monthly_value = df_normal.groupby(df_normal.index.month).mean()
    monthly_value.index = months
    
    # Calculate the given year average
    df_year = df[df.index.year == year]
    monthly_average = df_year.groupby(df_year.index.month).mean()
    monthly_average.index = months[:monthly_average.index[-1]]
    
    if 'TMAX' in df.columns:
        color = 'red'
        ylabel = 'Daily Maximum Temperature \n (degC)'

    elif 'TMIN' in df.columns:
        color = 'blue'
        ylabel = 'Daily Minimum Temperature \n (degC)'
    
    elif 'PRCP' in df.columns:
        color = 'green'
        ylabel = 'Daily Accumulated Precipitation \n (mm)'
        
    else:
        color='grey'
    
    return monthly_value.hvplot.line(xlabel='Monthly',
                                     ylabel=ylabel,
                                     label=f'Monthly Average {first_year} - {last_year}',
                                     color='k') * monthly_average.hvplot.line(label=f'Monthly Average {year}',
                                                                              color=color,
                                                                              line_width=3)

Plot the 2022 Comparison

Let’s start with 2022 - notice how relatively dry the last few months have been, with normal high daily maximum temperatures and above normal daily minimum temperatures.

Warm and dry are two ingredients for nice fall colors in the Midwest!

(plot_monthly_stats(precip_df) + plot_monthly_stats(tmax_df) + plot_monthly_stats(tmin_df)).cols(1)
Loading...

Plot 2015-2022 for Comparison

For comparison, here are the last seven years!

for year in np.arange(2015, 2023, 1):
    display((plot_monthly_stats(precip_df, year) + plot_monthly_stats(tmax_df, year) + plot_monthly_stats(tmin_df, year)).cols(1))
Loading...

Conclusions

It has been an absolutely beautiful fall in the Chicago area (and the Upper Midwest in general). Here is one of my pictures from mid October, from Matthiessen State Park in Oglesby, Illinois.

fall-colors

The availability of these cloud-hosted datasets is incredible - the analysis-ready nature of these datasets lower the barrier to science. Instead of parsing through thousands of text files, or merging tough-to-align CSVs, people can access these analysis-ready, cloud optimized datasets (stored in Parquet in this case), and jump right into their analysis.

Thank you NOAA and AWS for making these datasets available, Ned Haughton for the station table parsing code, and Ben for the motivation.