1 ESS Data Formats

Twenty years ago, there were a large number of disparate formats for storing large data sets, and the transfer of data from one computing system to another was a big deal, often relying on frequent playing of the “find-the-continents” game. Now, however, there are two general styles of dataset storage: 1) machine-independent self-documenting gridded datasets, and 2) relational data bases (e.g. MS Access, mySQL) (with common Excel formats (.csv, .xlsx) as special cases of database tables). Gridded data sets are chiefly represented by the netCDF and HDF5 formats, which include self-documenting “attribute” data, with an older “binary” data format (GRIB2) still in widespread use for exachanging real-time weather forecasting data, and there is growing “interoperatability” among these formats. Databases are in widespread use for storing data that may be heterogeneous in the sense of not being gridded, and possibly having multiple tables.

1.1 netCDF

netCDF (or Network Common Data Format) is the format most frequently used for storing climate-model output as well as some observational data. There is a well-documented convention (CF, for Climate-and-Forecast) for arranging and internally documenting netCDF datasets, which further contributes to simple transfer of data from one system to another.

There are several R packages for handling netCDF data. Of these, the ncdf4 and the related ncdf.helpers packages are the most useful in practice.

Several other packages exist for working with netCDF data sets (this is not an exhaustive list):

  • RCMIP5: tools for reading and summarizing “CMIP5” data link
  • easyNCDF: a set of functions for reading and writing netCDF data sets from and to R arrays
  • cmsaf: tools for reading EUMETSAT energy- and water-balance variables link
  • efts: functions for reading ensemble forecast data;
  • RNetCDF and ncdf.tools: functions for working with older netCDF 3 files.
  • ncdump: reads netCDF attribute data, and organizes it into dataframes

In addition, the rgdal and raster packages support the reading and writing of netCDF files.

1.2 HDF

HDF (or Hierarchical Data Format), like netCDF is a machine-independent self-documenting gridded dataset format, that is in common use for storing satellite and remote-sensing imagery data. It currently exists in several formats (HDF4, HDF5, HDF-EOS) which can generally converted to one another. (netCDF4 in fact uses the HDF5 format for storing data.)

Reading and writing HDF5 files in R is supported by the rhdf5 package in the Bioconductor collection:

  • rhdf5: a Bioconductor package for reading and writing HDF5 files [link]

The National Snow and Ice Data Center (NSIDC) as a nice tutorial on HDF-EOS: https://nsidc.org/data/hdfeos/intro.html

NOAA’s Coral Reef Watch http://www.coralreefwatch.noaa.gov/satellite/hdf/index.php serves up a variety of HDF (and netCDF) files for coral-bleaching monitoring.

1.3 GRIB

GRIB (or GRIdded Binary / General Regularly-distributed Information in Binary form, https://en.wikipedia.org/wiki/GRIB)

1.4 Relational Data Bases

Relational databases can be thought of as a set of linked tables that efficiently store data that, if stored as a single rectangular table would be inefficiently large and difficult to extract data from. Nevertheless, there are multiple ESS data sets, usually collections of site-specific data (where the sites are irregularly distributed) that can be efficiently stored and queried

An example ESS dataset stored as a relational database:

The gridded dataset formats described above are sometimes referred to as nonSQL databases, for the simple reason that they are not SQL databases (where SQL (pronounced “sequel”) stands for Structured Query Language). The distinction between the two is described here:

2 Getting and Displaying ESS Datasets

There are several approaches for getting or transferring the usually large data sets that are employed in doing Earth-system science.

2.1 SFTP and Globus

FTP (for File Transfer Protocol) is a now pretty-old traditional way of moving data around. In its standard form, it’s not very secure (and hence IT services hate it), but it’s still quite functional. A more secure variant is SFTP (SSH File Transfer Protocol, also known as “Secure FTP”). The most widely used “client” for interacting with ftp sites is likely Filezilla https://filezilla-project.org/. An SFTP “site” has been created for this course, for directions on its use, see File Transfer the Tasks tab, as well for instructions on using Filezilla. Another newer approach for transferring files is Globus, which provides a browser-based application for transferring files among “endpoints”.

2.2 THREDDS and OPeNDAP

THREDDS (Thematic Real-time Environmental Distributed Data Services) Data Servers (TDS) provide a mechanism for making remote datasets (generally netCDF datasets) “visible” to local applications (like Panoply). TDS display “catalogs” of multiple files that can, for example, be browsed by Panoply, and individual data sets can then be opened. OPeNDAP provides a further mechanism for subsetting data sets, i.e. selecting an individual slice from a 3d or 4d data set. Both THREDDS and OPeNDAP thus provide a way of avoiding downloading and storing data locally. (Once data are downloaded to a local machine, they are in a sense “frozen”. By not storing datasets locally, and instead accessing them remotely, any updates are automatically included.

Here is a local example of THREDDS-served data:

Here is the THREDDS data server at Unidata (aka “motherlode”)

2.3 Panoply

Panoply [link] is cross-platform application that can read and display netCDF, HDF and GRIB datasets. See the Tasks tab on this page for directions for installing Panoply. In addition to being able to read and display files on the local file system. Panoply can also open catalogues and individual data sets. UNIDATA’s Integrated Data Viewer [IDV] provides some additional viewing options.

3 Some ESS Data Sources

3.1 Climate-model output

CMIP5 is the climate-modeling component of the IPCC AR5 assessment, and nearly all of the data are available online through the ESG (Earth System Grid). Additional climate-simulation data are available from the National Center for Atmospheric Research (NCAR).

3.2 NOAA ESRL PSD

The NOAA Earth System Research Laboratory (formerly the Climate Diagnostics Center, CDC) provides an array of gridded data sets, both historical and observational, including the historical “reanalysis” data sets.

3.3 UNIDATA Meteorological case studies

UNIDATA provides a number of case-study data sets of “unique atmospheric phenomena” that can be accessed via their THREDDS server. Here is a [link] to the information page, and to the [Case Studies Library]

3.4 Paleoclimatic datasets

There are two main repositories of paleoclimatic data, Pangaea (European) and NOAA Paleoclimatology:

Other sources of paleoclimatic data include:

3.5 General global-change data

3.6 Data.gov

3.7 Oregon lidar

3.8 Some “Big Data” initiatives