uscrn.get_nrt_data

Contents

uscrn.get_nrt_data#

uscrn.get_nrt_data(period, which='hourly', *, n_jobs=None, cat=False)#

Get USCRN near-real-time data.

These are the “update” files sent out through the GTS weather wire system.

Unlike the archive files (get_data()), where sites are in separate files, these files contain data from all sites in one file, reducing the amount of data that needs to be downloaded and processed to get the most recent data.

2020-10-06 20 UTC is the first available hourly file, while 2020-10-06 is the first available daily file.

Note

Variable and dataset metadata are included in the .attrs dict. These can be preserved if you have pandas v2.1+ and save the dataframe to Parquet format with the PyArrow engine.

df.to_parquet('crn.parquet', engine='pyarrow')
Parameters:
  • period (Any | tuple[Any, Any]) – Single element or 2-tuple expressing the (inclusive) time bounds of the period of interest (UTC). Elements can be integers (used to slice the list of available files) or something coercible to a pandas.Timestamp (e.g. str, datetime.datetime, pandas.Timestamp). Timestamps are treated as inclusive bounds, while integers follow normal Python slicing rules (upper bound is exclusive). Timestamps correspond to the file name time (see examples and notes below). Use None to indicate an open-ended bound; period=None means to load all available files. Timestamps without timezone are assumed to be in UTC.

  • which (Literal['hourly', 'daily']) – Which dataset. Only hourly and daily are available.

  • n_jobs (int | None) – Number of parallel joblib jobs to use for loading the individual files. The default is to use min(joblib.cpu_count() - 1, num_files).

  • cat (bool) – Convert some columns to pandas categorical type.

Return type:

DataFrame

Examples

>>> import uscrn

Latest available hourly data:

>>> df = uscrn.get_nrt_data(-1, "hourly")

Last 12 hourly files:

>>> df = uscrn.get_nrt_data((-12, None), "hourly")

Get the 2023-08-31 17 UTC hourly file (majority of the data is for the 16 UTC hour):

>>> df = uscrn.get_nrt_data("2023-08-31 17", "hourly")

All the files from that day:

>>> df = uscrn.get_nrt_data(("2023-08-31 00", "2023-08-31 23"), "hourly")

Latest available daily data:

>>> df = uscrn.get_nrt_data(-1, "daily")

Notes

In the NRT files files, the time in the file name is

  • for daily, the end of the day (23:59). The time in the file left-labels the period (i.e., 00:00 of the same day).

  • for hourly, the next hour (e.g., 19:00) The time in the file right-labels the hourly periods. In our example with 19:00 from the file name, time in the file should be mostly 18:00, with 17:00 or 16:00, two or three hours behind the file name time, also possible for some sites, depending on the file). For example, see 2024020919, which has mostly 18:00 data, but also some 17:00 and 16:00 data.

That is, the hourly files contain data received during the previous hour.

Some info from Howard Diamond (somewhat paraphrased):

The stations transmit data once an hour, but because the times when they transmit rotate around the hour, some stations transmit so early in the hour that they are current only through the previous hour. Therefore, some of these update files will have data from previous hours, in addition to the most recent hour, as a matter of normal course. Stations that are two hours behind may have had a missing transmission and are an hour further behind. Stations transmit blocks of two hours at a time.

See also

NRT data

Notebook example demonstrating using this function to get recent data.

get_data()

To load a year or more of the archive data instead.