CGA develops RINX: An Open-Source Solution for Information Extraction from Big Raster Datasets

March 27, 2023
cga_squared_1.jp

By Devika Kakkar and Jeff Blossom
 

Processing Earth observation data modeled in a time-series of raster format is critical to solving some of the most complex problems in geospatial science ranging from climate change to public health. Researchers are increasingly working with these large raster datasets that are often terabytes in size. At this scale, traditional GIS methods may fail to handle this processing and new approaches are needed to analyze these datasets. The objective of this work is to develop methods to interactively analyze big raster datasets with the goal of most efficiently extracting vector data over specific time periods from any set of raster data.

Methodology

RINX (Raster INformation eXtraction) is an end-to-end solution developed by the authors for automatic extraction of information from large rasters datasets. RINX heavily utilizes open source geospatial techniques for information extraction. It also complements the traditional approaches with state-of-the-art high-performance computing techniques. The input for RINX is a set of rasters from which the information has to be extracted and a set of data point locations for which the information needs to be extracted. The output for RINX is a structured representation of extracted information from the raster datasets for each data point in CSV text format. The loading and pre-processing of the input datasets to RINX is accomplished using a combination of Bash and SQL scripting techniques for automation. This pre-processed input is then fed into the open source spatial database PostGIS to extract the required information by using multiple spatial techniques. Finally, the extracted output is post-processed for deduplication and standardization of extracted information for research use. RINX is designed in a way that makes it easy to deploy and scale on any local, cloud, or cluster computing platform. RINX was created to aid the study of environmental conditions and how they affect the health of people over their lifespans for Project Viva which is described in detail in the following sections. The architecture diagram of RINX is included below:

Project Viva flowchart

Use Case: Project Viva

The Environmental influences on Child Health Outcomes (ECHO) (National Institute of Health. n.d.) is a nation-wide program in the United States funded by the National Institutes of Health. ECHO includes over 60 cohorts of children and their mothers, and is aimed to help better understand effects of environmental exposures on child health and development. One of the ECHO cohorts in the Boston area is Project Viva, a Boston-based longitudinal study including a cohort of some 2,000 mothers and children. The goal of Project Viva is to find ways to improve the health of mothers and their children by looking at the effects of mother's diet as well as other factors during pregnancy and after birth. A key part of the analysis is calculating various social and environmental exposures at the Viva cohort member address locations over their life spans.

RINX was created to aid the study of environmental conditions and how they affect the health of people over their lifespans. This involves calculating exposures such as air pollution, humidity, precipitation, temperature, and other exposures at cohort member address locations over time. For initial work with one cohort, daily precipitation, temperature, and humidity estimates were needed for 4,796 cohort address locations for a 19 year time period, 1999 – 2017.

The 800-meter resolution PRISM Spatial Climate Dataset for the Conterminous United States was used as the input for this data extraction. PRISM refers to Parameter-elevation Relationships on Independent Slopes Model, created by the PRISM Climate Group, Oregon State University. The PRISM dataset is published in .BIL raster format, with one raster representing one climate variable per day for the time period 1981 - 2020. The total size of the dataset is around 8 TB with over 100,000 rasters of size 85 MB each. The Figure below shows a mean temperature map for January 1, 1981 using PRISM 800m climate data.

Heat map with title "Mean temperature on January 1, 1981"

 

Results

For work on the initial cohort, RINX enabled the extraction of 7 key climate variables: precipitation, temperature (maximum, minimum, mean), dew point temperature (mean), and vapor pressure deficit (minimum, maximum) for 19 years of data from 48,500 800-meter resolution rasters for 4,796 data points. This resulted in a total of 10.3 Million “patient-day” calculations creating a total of 72.1M observations. Additionally, absolute and relative humidity were calculated using the existing mean temperature and dewpoint variables. RINX provided a unified solution of 9 climate variables for all persons/days for the entire dataset. It was deployed and scaled on multiple servers on a high-performance computing cluster. Our initial results reveal that it is extremely fast and efficient in processing large raster datasets. It took 1 day to load and 4 days to process and extract 7 climate variables from 48,500 rasters for the 72.1M observations at 4,796 locations. RINX enabled the researchers to analyze this big climate dataset at a fine-grained address level with high efficiency and speed. Once the scripts were written, tested, and fine tuned, processing time was reduced from months to days compared to traditional methods, resulting in substantial time savings. The results are discussed in detail in the below publication:

Devika Kakkar, Jeffrey Blossom, and Wendy Guan. 8/5/2022. “RINX: A Solution for Information Extraction from Big Raster Datasets.” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences.

The value of RINX is well-explained in the words of Dr. Nicholas Nassikas​, MD, Beth Israel Deaconess Medical Center​ Instructor, Harvard Medical School “The PRISM climate data extracted by the CGA allowed us to study associations of precipitation, relative humidity and temperature with lung function in children. The climate data will also allow us to study similar associations in adults. There is a need to determine if short term exposure to these weather conditions affects the respiratory health of children and adults, especially in the context of a changing climate.”

Conclusions

Our solution is based on open source technology, using PostGIS that can be deployed on local or cluster computing environments. It provides an efficient way to solve geospatial big data problems, particularly those involving large temporal raster datasets where point location data extraction is desired. Big data is changing the ways data is managed and analyzed. The next generation GIS tools can help researchers process big data at scale. RINX is an end-to-end data extraction and processing solution for large raster datasets. RINX is open-source and is shared on CGA Github here. It can be easily deployed and scaled on any local, cloud, or cluster computing environment. We used RINX for processing on a large number of PRISM climate datasets, however our solution could be applied to any temporal raster data such as NDVI, night lights, and more.

Acknowledgements

This work is sponsored by Dr. Diane Gold of the Harvard T.H. Chan School of Public Health (HSPH) within the NIH-ECHO program, grants UH3OD023286. Data analysis assistance with the Viva cohort provided by Heike Gibson of HSPH. This work is also partially sponsored by NSF Award #1841403. We would also like to acknowledge Dr. Chris Daly and Dr. Dylan Keon of the PRISM Climate Group at Oregon State University who provided helpful guidance on using the PRISM data.

Questions/Comments

Any questions or comments on this project can be sent to Devika Kakkar and Jeff Blossom.