Alumni Project

Efficient Processing and Access of Very Large Datasets

Ekow Otoo, Doron Rotem, Arie Shoshani, John Wu (LBNL)
Randy Burris, Dan Million, Tom Potok and Joel Reed, Frederick Sheldon (ORNL)
Collaborators: Wendy Koegler, Jacqueline Chen (SNL-Livermore)

Summary

Large-scale simulations already produce several terabytes of data per simulation run, and are projected to reach hundreds of terabytes or even petabytes as computing power increases. This project within the Scientific Data Management Center addresses the problem of accessing, processing, and analyzing such enormous quantities of data efficiently and effectively. The technology used in this project includes efficient indexing techniques, cache management for file sharing, improved I/O and agent-based parallel processing of very large datasets. These technologies have been applied to three SciDAC application domains: High Energy Physics, Combustion, and Astrophysics, and achieved a high degree of efficiency.

Access to data should be simple, it should not require unnecessary operations and it should be efficient. In this project we address all three attributes. Since robotic tape storage is now required to store the massive quantities of data developed in the scientific domains we address, each of our activities relates in some way to tape storage, and to the efficient processing of the data once we stage them to disk storage.

In one approach to improved efficiency we developed an efficient index useful for billions of objects. The index allows quick identification of a subset of files to be retrieved, eliminating unnecessary file access. Indexing over properties of billions of objects can be quite slow using conventional indexing methods. Our indexing method represents data in compressed bitmaps that can be rapidly evaluated. By devising a new word-aligned compression scheme, labeled WAH, we achieved a ten-fold speedup relative other well-known indexing techniques (Figure 1).

figure 1

Figure 1: Query processing performance.

In a second application of bitmap indexing technology we used combustion data. When analyzing numerical simulations of the auto-ignition process, scientists are interested in identifying and tracking flame fronts. Using the bitmap index software and a bitmap-based region-growing algorithm, tests on real data show a ten-fold speedup in finding flame fronts compared to previous methods (Figure 2). The technique scales linearly.

figure 2

Figure 2: Flame fronts

In another efficiency-improving activity we designed an efficient disk cache algorithm, maximizing file sharing and minimizing retrievals from tape. The algorithm, when applied to a cache containing N files, determines which file to replace in O(log2N) time. In simulated workloads our method is superior to other methods. We will test these algorithms with real workloads in a dedicated part of the SDM Center facility at ORNL that includes HPSS, the mass storage system commonly used by the High Energy Physics community. (The facility is also being heavily used by the data-analysis portion of the SDM ISIC.)

We are also targeting improved file operations. Accomplishments include a mechanism for simple transfers between HPSS installations, improved HPSS I/O performance and improved wide-area network transfer performance. The first two are now in production at various HPSS installations. Current efforts focus on implementing wide-area network transfer enhancements in production and on forming a bridge between HPSS and PVFS, the parallel virtual file system being enhanced in another activity of the SDM Center.

When handling massive quantities of data, parallel processing is often possible, depending on the scientific problem being addressed. However, partitioning a problem into parallel components is a difficult task and a burden to the scientist. To address this problem, we use software agents in parallel to process, reduce and visualize large amounts of data. This agent-based technology was applied to astrophysics data provided by the Terascale Supernova Initiative (TSI) project. The software agents, operating in parallel on a large dataset, are able to quickly analyze and visualize that dataset.

The system architecture contains multiple agents, each of which has one of three basic tasks. A single data controller agent studies the dataset, partitions it and deploys a data agent to handle each partition. The data agents apply compression and reduction techniques at a selectable level of detail. A movie producer agent receives data partitions from a data agent and renders them into a set of ../images with selectable levels of detail.
Figure 3 shows the result of applying the system to 100 time steps of the supernova simulation; each time step is stored in a 130-megabyte file. The controller agent deployed 800 data agents across the 100 data files.

figure 3

Figure 3: Agent-based parallel processing and visualization of a supernova simulation

Our activities address a variety of ways to improve the ability of the scientist to select and access important data from massive datasets, and to do so efficiently and rapidly. Scientific discovery will be the beneficiary.

For further information contact:
Dr. Arie Shoshani
Lawrence Berkeley National Laboratory
Tel: (510) 486-5171
Email: shoshani@lbl.gov

S.D.M. Center

back to project page

 


Home  |  ASCR  |  Contact Us  |  DOE disclaimer