Alumni Project
The Scientific Data Management Center
PI: Arie Shoshani (LBNL)
Coordinating PIs: Bill Gropp (ANL), Arie Shoshani (LBNL), Terence Critchlow
(LLNL), Thomas Potok (ORNL), Calton Pu (Georgia Tech), Mladen Vouk
(NCSU), Alok Choudhary (NWU), Reagan Moore (SDSC)
Area Leaders: Rob Ross (ANL), Doron Rotem (LBNL),
Terence Critchlow (LLNL), Nagiza Samatova (ORNL)
(http://sdmcenter.lbl.gov)
Summary
The Scientific Data Management Center focuses on the application of known and emerging data management technologies to scientific applications. The Center’s goals are to apply and deploy software-based solutions to the efficient and effective management of large volumes of data generated by simulation and analysis of scientific applications. Our purpose is not only to achieve efficient storage and access to the data using specialized indexing, compression, and parallel technology, but also to enhance the effective use of the scientist’s time by eliminating unproductive simulations, by providing specialized data mining techniques, and by automating time consuming tasks. Our approach is to work closely with application scientists in various domains on specific problems that will enhance their ability to achieve new scientific insights.
Increases in computational power have created the opportunity for new,
more precise and complex scientific simulations leading to new scientific
insights. However, the improved simulations have increased the amount
and complexity of the data they, and subsequent analyses, generate.
Our Center applies newly emerging data management technologies to these
problems, targeting needs of specific SciDAC related projects in the
domains of Astrophysics, Molecular Biology, High Energy Physics, Climate
Modeling, and Combustion. The collaborations are focused and productive.
We highlight some of the results to date.
The work in the Astrophysics domain is based on a collaboration with the SciDAC Terascale Supernova Initiative (TSI), specifically to deal with their large volume of data. One approach was to apply new block-based Principle Component Analysis techniques which achieved 30-fold compression of 3D data with 99% accuracy (total variability). Subsequent analyses only deal with the reduced dataset. A second approach developed a "run and render" system that applies various analysis techniques to partial simulation results and visualizes the results. As a result, if the simulation seems to be heading in the wrong direction it can be corrected early, saving a tremendous amount of unnecessary computation. We are also applying agent technology to partition a large dataset and deploy data agents to handle each chunk of the dataset. The data agents use compression and reduction techniques to respond to requests for data at a selectable level of detail.
The NetCDF data format is popular in the Astrophysics and Climate Modeling domains. Prior to our Center’s work, NetCDF data were written in a single stream, choking the processing of large parallel simulations. Our Center is developing a parallellized version of NetCDF by developing a software layer over the well-known MPI-IO interface. This approach achieved over 100 megabytes/sec, a 10-fold improvement. Even more remarkable, the effectiveness of reading subsets of the data from these files was greatly improved, and shown to be generally insensitive to the data layout. A related activity is the automatic analysis of access patterns and extraction of hints to be used by the parallel I/O system.
In the Climate Modeling domain, like many other scientific domains, the problem of interpreting the data is paramount. One activity of the Center is in applying application-specific knowledge to enhance known data mining techniques. Specifically, one such activity in the Center was to use Principal Component Analysis and Independent Component Analysis as two ways of reducing the dimensionality of climate data sets. These techniques were used to isolate the effects of El Nino signals from global temperatures, contributing to better understanding of observed climate phenomena.
In the Molecular Biology arena, we have concentrated on the process of analyzing microarray data. This process currently requires numerous interactions with web-based systems that match the microarray sequences to known sequences. This task is so tedious that it takes several months to analyze each microarray. We have applied two technologies to this problem: web-wrapping and workflow management. By doing so we have automated the process and allowed the scientist to discover and report early findings.
The work with High Energy Physics data is based on collaborating with the Particle Physics Data Grid (PPDG). In their analyses, several billion objects must be searched to identify a desired subset. We have developed and applied a specialized indexing method (called bitmap indexing) that can find the points of interest 10-100 times faster than a simple search. We also developed efficient cache management policies to optimize file sharing. These techniques allow a new dynamic analysis methodology that accelerates the discovery process.
In the area of Combustion, finding regions of interest in large simulations is an important capability. Applying a simple search method over the variable values in the entire space is too slow to be practical. Applied in this domain, the bitmap indexing technology finds points of interest 10-100 times faster than simple search. Combined with a new bitmap "region growing" algorithm, this method can be used to find regions of interest over a billion points in real time.
The Center enjoys a dedicated facility that permits early experimentation with new technology. The facility is being used to develop and test parallel computations and efficient storage techniques, including grid applications.
These early successes demonstrate the value of having computer scientists working closely with application scientists in managing scientific data. The Scientific Data Management Center will continue to apply techniques that were proven effective to new scientific domains, thus sharing and applying the technology to additional scientific problems in new domains. Our experience has shown that working closely with application scientists opens new opportunities to improve scientific exploration over vast and diverse datasets.
For further information contact:
Dr. Arie Shoshani
Computing Sciences Division
Lawrence Berkeley National Laboratory
Tel: (510) 486-5171
Email: shoshani@lbl.gov
back to project page