Compare data from the Disdrometer and Rain Gauge

import xarray as xr
import hvplot.xarray
import hvplot.pandas
import pandas as pd
import holoviews as hv
from distributed import Client, LocalCluster
hv.extension('bokeh')

Spin up a Dask Cluster

cluster = LocalCluster()
client = Client(cluster)
client
distributed.diskutils - INFO - Found stale lock file and directory '/Users/mgrover/git_repos/SAILS-radar-analysis/notebooks/precipitation-exploration/dask-worker-space/worker-ir4zk1dc', purging
distributed.diskutils - INFO - Found stale lock file and directory '/Users/mgrover/git_repos/SAILS-radar-analysis/notebooks/precipitation-exploration/dask-worker-space/worker-xti6v0kt', purging
distributed.diskutils - INFO - Found stale lock file and directory '/Users/mgrover/git_repos/SAILS-radar-analysis/notebooks/precipitation-exploration/dask-worker-space/worker-z1_b971u', purging
distributed.diskutils - INFO - Found stale lock file and directory '/Users/mgrover/git_repos/SAILS-radar-analysis/notebooks/precipitation-exploration/dask-worker-space/worker-9gganotb', purging

Client

Client-c0f4f414-b03e-11ec-975c-acde48001122

Connection method: Cluster object Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

Read in the data

Here, we use data from:

  • a disdrometer (datastream: gucldM1.b1)

  • a rain gauge (datastream: gucwbpluvio2M1.a1)

We use all the data available, you can query from this website: https://adc.arm.gov/discovery/#/

disdrometer_ds = xr.open_mfdataset('../../data/disdrometer/*')
gauge_ds = xr.open_mfdataset('../../data/rain-gauge/*')

Visualize the Data

We use hvplot here to plot an interactive view of our data, utilizing rasterize to dynamically view our large datasets

from bokeh.models import DatetimeTickFormatter

def apply_formatter(plot, element):
    plot.handles['xaxis'].formatter = DatetimeTickFormatter(hours='%m/%d/%Y \n %H:%M',
                                                            minutes='%m/%d/%Y \n %H:%M',
                                                            hourmin='%m/%d/%Y \n %H:%M',
                                                            days='%m/%d/%Y \n %H:%M',
                                                            months='%m/%d/%Y \n %H:%M')
disdrometer_precip_rate = disdrometer_ds.where(disdrometer_ds.qc_precip_rate == 0).precip_rate.compute()
gauge_precip_rate = gauge_ds.where(gauge_ds.pluvio_status == 0).intensity_rtnrt.compute()
gauge_precip_rate.attrs['long_name'] = disdrometer_precip_rate.attrs['long_name']
gauge_precip_rate['name'] = disdrometer_precip_rate.name
max_precip = 50
disdrometer_rate_plot = disdrometer_precip_rate.hvplot.line(ylim=(0, max_precip),
                                                            label='Disdrometer',
                                                            rasterize=True,
                                                            xlabel='Date/Time',
                                                            title='Disdrometer (gucldM1.b1)').opts(hooks=[apply_formatter])
bucket_rate_plot = gauge_precip_rate.hvplot.line(ylim=(0, max_precip),
                                                 rasterize=True,
                                                 label='Rain Gauge',
                                                 xlabel='Date/Time',
                                                 title='Rain Gauge (gucwbpluvio2M1.a1)').opts(hooks=[apply_formatter])

View our Final Visualization

We can combine these into a single column of plots, matching the views/colorbars

(disdrometer_rate_plot + bucket_rate_plot).cols(1)

Look for Periods of Precip

We use this example of looking for periods above a threshold in pandas - https://stackoverflow.com/questions/55014867/pandas-threshold-data-sequence-based-on-the-length-of-pattern

The main steps here are:

  • convert our data to a dataframe

  • mask of records above baseline (precipitation rate > 0)

  • label groups of sequenced positive values

  • check the size of each sequence group

  • filter values that are positive and the sequence size is lower than n time (minutes)

    • determine an “event threshold” value (in minutes)

  • determine an event as one of these occurences

Convert to a pandas.DataFrame

Convert our Rain Gauge Data to a pandas.DataFrame For this exercise, we will use a pandas.DataFrame instead of an xarray.Dataset

df = pd.DataFrame({'gauge_precip':gauge_precip_rate.values},
                  index=gauge_precip_rate.time.values)

Filter for Positive Precipitation Values

masked_gauge = df > 0
masked_gauge
gauge_precip
2021-04-23 17:55:00 False
2021-04-23 17:56:00 False
2021-04-23 17:57:00 False
2021-04-23 17:58:00 False
2021-04-23 17:59:00 False
... ...
2022-03-29 20:55:00 False
2022-03-29 20:56:00 False
2022-03-29 20:57:00 False
2022-03-29 20:58:00 False
2022-03-29 20:59:00 False

428505 rows × 1 columns

Separate our data based on consecutive, non-missing values

sequence_groups = abs(masked_gauge.astype(int).diff(1).fillna(0)).cumsum().squeeze()
sequence_groups
2021-04-23 17:55:00       0.0
2021-04-23 17:56:00       0.0
2021-04-23 17:57:00       0.0
2021-04-23 17:58:00       0.0
2021-04-23 17:59:00       0.0
                        ...  
2022-03-29 20:55:00    9978.0
2022-03-29 20:56:00    9978.0
2022-03-29 20:57:00    9978.0
2022-03-29 20:58:00    9978.0
2022-03-29 20:59:00    9978.0
Name: gauge_precip, Length: 428505, dtype: float64

Use Groupby to bin our groups, and transform back to the dataframe

sequence_size = masked_gauge.groupby(sequence_groups).transform(len)
sequence_size
gauge_precip
2021-04-23 17:55:00 20032
2021-04-23 17:56:00 20032
2021-04-23 17:57:00 20032
2021-04-23 17:58:00 20032
2021-04-23 17:59:00 20032
... ...
2022-03-29 20:55:00 261
2022-03-29 20:56:00 261
2022-03-29 20:57:00 261
2022-03-29 20:58:00 261
2022-03-29 20:59:00 261

428505 rows × 1 columns

Merge our filters into a single dataframe

Currently, these groups/checks are in separate dataframes - let’s bring these together!

df_extended = pd.concat([df, masked_gauge, sequence_groups, sequence_size], axis=1)
df_extended.columns = ['value', 'is_positive', 'sequence_group', 'sequence_size']
df_extended
value is_positive sequence_group sequence_size
2021-04-23 17:55:00 0.0 False 0.0 20032
2021-04-23 17:56:00 0.0 False 0.0 20032
2021-04-23 17:57:00 0.0 False 0.0 20032
2021-04-23 17:58:00 0.0 False 0.0 20032
2021-04-23 17:59:00 0.0 False 0.0 20032
... ... ... ... ...
2022-03-29 20:55:00 0.0 False 9978.0 261
2022-03-29 20:56:00 0.0 False 9978.0 261
2022-03-29 20:57:00 0.0 False 9978.0 261
2022-03-29 20:58:00 0.0 False 9978.0 261
2022-03-29 20:59:00 0.0 False 9978.0 261

428505 rows × 4 columns

Determine our Event Duration Thresholds

As mentioned before, we are interested in finding consecutive periods of precipitation. The question is - how long of a period are we looking for? 5 minutes? 30 minutes? Let’s try out some thresholds, ranging from 5 to 30 minutes, investigating how increasing the threshold by one minute impacts how many groups we end up with.

group_sizes = np.arange(5, 31, 1)
df_list = []
for group_size in group_sizes:
    good_groups = (df_extended.sequence_size >= group_size) & (df_extended.is_positive)
    
    good_groups_df = pd.DataFrame({'time_threshold':str(group_size),
                                   'group_size':len(df_extended[good_groups].sequence_group.unique())},
                                  index=[0])
    df_list.append(good_groups_df)

group_size_df = pd.concat(df_list, ignore_index=True)

Visualize our Threshold Value Options

It looks like there is a steep drop off (a lot fewer events) after 11 minutes - 11 minutes looks like the sweet spot before we filter out too many events!

group_size_df.plot.bar(x='time_threshold',
                       y='group_size',
                       xlabel='# of Consecutive Minutes with Precip',
                       ylabel='Count',
                       title=f'Number of Precip Periods Between \n {str(masked_gauge.index.min())} and {str(masked_gauge.index.max())}',
                       legend=False);
../_images/disdrometer-gauge-comparison_28_0.png

Apply our filter to the dataset

Now that we have our threshold, let’s apply this to our dataset. We filter for:

  • Events greater than 11 minutes in duration

  • Events with positive precipitation

precip_events = df_extended[(df_extended.sequence_size >= 11) & (df_extended.value >= 0)].sequence_group.unique()
precip_events.astype(int)
array([   0,    2,    8, ..., 9962, 9966, 9978])

Now that we have our “precip_events”, let’s go back to an xarray.Dataset

We use a helper function to convert back to our xarray.Dataset

def create_dataset(df,
                   group,
                   disdrometer_precip_rate=disdrometer_precip_rate,
                   gauge_precip_rate=gauge_precip_rate):
    """
    Creates an xarray.Dataset from a dataframe with identified groups
    
    Parameters
    ----------
    df = pd.Dataframe
      Dataframe with the identified groups (sequence group) - see previous cells

    group = float
      Group number, identified by previous cells

    disdrometer_precip_rate: xarray.DataArray
      Disdrometer precipitation rate (mm/hr)

    guage_precip_rate: xarray.DataArray
      Rain gauge precipitation rate (mm/hr)
      
    Returns
    -------
    ds_out: xarray.Dataset
       Dataset for given group, subset with corresponding time and data values
    """
    # Identify the group
    df_event = df_extended.loc[df_extended.sequence_group == group]
    
    # Grab the start and end time
    start_time = df_event.index.min()
    end_time = df_event.index.max()
    
    # Grab the disdrometer values
    disdrometer_values = disdrometer_precip_rate.sel(time=slice(start_time, end_time))
    gauge_values = gauge_precip_rate.sel(time=slice(start_time, end_time))
    
    ds_out = xr.Dataset({'disdrometer':disdrometer_values,
                         'rain_gauge':gauge_values})
    
    ds_out['start_time'] = ds_out.time.data.min()
    ds_out['end_time'] = ds_out.time.data.min()
    
    ds_out = ds_out.set_coords(['start_time', 'end_time'])
    
    ds_out['event_start_time'] = ds_out.time.data.min()
    
    return ds_out

Loop through the different groups and create datasets

We loop through and do this for each of precipitation events, and merge into a single dataset

ds_list = []
for event in precip_events:
    ds_list.append(create_dataset(df_extended, event))
all_events = xr.concat(ds_list, dim='event_start_time')
all_events
<xarray.Dataset>
Dimensions:           (time: 411032, event_start_time: 1106)
Coordinates:
  * time              (time) datetime64[ns] 2021-04-23T17:55:00 ... 2022-03-2...
    start_time        (event_start_time) datetime64[ns] 2021-04-23T17:55:00 ....
    end_time          (event_start_time) datetime64[ns] 2021-04-23T17:55:00 ....
  * event_start_time  (event_start_time) datetime64[ns] 2021-04-23T17:55:00 ....
Data variables:
    disdrometer       (event_start_time, time) float64 nan nan nan ... nan nan
    rain_gauge        (event_start_time, time) float32 0.0 0.0 0.0 ... 0.0 0.0

Compare Disdrometer and Rain Gauge Precipitation Values for Each Event

Let’s take a look at the last 100 events - how do precipitation rates for the:

  • Disdrometer

  • Rain Gauge

Compare with one another?

g = all_events.isel(event_start_time=range(-101,-1)).plot.scatter(x='rain_gauge',
                                                                 y='disdrometer',
                                                                 col='event_start_time',
                                                                 col_wrap=10)

g.set_xlabels('Rain Gauge Precip Intensity \n (mm/hr)')
g.set_ylabels('Disdrometer Precip Intensity \n (mm/hr)')

g.set_titles('Event Start: {value}', maxchar=50)

plt.ylim(0,10)
plt.xlim(0,10)
plt.show()
plt.savefig('last_100_disdrometer_gauge_comparisons.png', facecolor='white', transparent=False);
../_images/disdrometer-gauge-comparison_37_0.png