Project Aims to Efficiently Reduce Massive Scientific Data

The amount of data produced each year by scientific user facilities such as those at national labs or government organizations can range up to several billion gigabytes per year. This massive amount of data generation has now begun to exceed researchers’ capacity to effectively parse this data in order to accomplish their scientific goals — a supersized problem when it comes to achieving new scientific advances.

To develop new mathematical and computational techniques to reduce the size of these data sets, the U.S. Department of Energy (DOE) awarded $13.7 million to nine projects as part of the Advanced Scientific Computing Research (ASCR) program in September 2021. A team led by Byung-Jun Yoon, associate professor in the Department of Electrical and Computer Engineering at Texas A&M University, has received $2.4 million to address the challenges of moving, storing and processing the massive data sets produced and processed by scientific workflows.

The overarching principle of this project is to focus on the scientific objectives of each data set and maintain retention of the quantities of interest (QoI) that pertain to the objectives. By optimizing the representation of the data while keeping the focus on the scientific goals at hand, Yoon’s team is able to preserve the important information that can lead to scientific breakthroughs despite the significant reduction in data size.

“Our idea is to not only significantly reduce the amount of data but to ultimately preserve the goals for which the data is intended to serve,” Yoon said. “That’s why we call it the objective-based data reduction for scientific workflows. We want to reduce the amount of data but not sacrifice the quantities or qualities of interest.”

One of the first steps Yoon’s team will take to accomplish this goal is to utilize an information-theoretic approach to find a compact representation of the data by exploiting semantics and invariances. They will also look at how data reduction impacts the achievement of the final goals, based on which they will jointly optimize the models that compose general scientific workflows.

An example of how an overwhelming amount of data can become unmanageable is cryogenic electron microscopy (cryo-EM), which is a method widely used for molecular structure analysis. During cryo-EM, typical datasets are composed of thousands of micrographs that contain projection images of the molecules in various orientations that are several terabytes in size. Another example is through X-ray scattering experiments, which are routinely performed to analyze material structure. When performed in a mapping mode where the X-ray exposures are performed across a sample’s cross-section, a single scattering map is a 4D dataset that may contain around 10 billion values.

“The thing that I’m most excited about is probably for the first time we are looking into this data reduction problem from an objective-based perspective, which I believe may not have been done by others,” Yoon said. “We are proposing a metric that can be used for objective-based quantification of the impact of data reduction, and then optimizing the data reduction pipeline by using this metric so that we can preserve the usability of the data to support the final goal. The ultimate performance that we can bring by applying this idea to our data reduction is also very exciting.”

The mission of the ASCR program is to discover, develop and deploy computational and networking capability to analyze, model, simulate and predict complex phenomena important to the DOE and the advancement of science.

In addition to Yoon, co-principal investigators are Edward Dougherty and Xiaoning Qian from the electrical and computer engineering department at Texas A&M. This project also involves collaborators at Brookhaven National Lab and the University of Illinois at Urbana-Champaign.


Substack subscription form sign up