News
Defining the Agenda for “Big Data” Benchmarking
Published May 16, 2012
The move to data-driven science and decision-making is necessitating the need for a comprehensive benchmarking of ‘big data’ applications as well as price/performance across the board, according to attendees at a recent workshop organized by the San Diego Supercomputer Center (SDSC) at the University of California, San Diego.
Big data applications are characterized by the need to provide timely analytics while dealing with large data volumes, high data rates, and a wide range of data sources. Many of those datasets are so voluminous that most conventional computers and software cannot effectively process them.
The Workshop on Big Data Benchmarking (WBDB2012), organized by SDSC’s Center for Large-scale Data Systems research (CLDS), was held May 8-9 in San Jose, Calif. The workshop, supported by the National Science Foundation (NSF), was quickly oversubscribed, attended by about 60 experts representing more than 45 institutions from industry and academia.
The meeting brought together experts in large-scale data, parallel database systems, benchmarking and system performance, cloud storage and computing, and related areas. Sponsors for the event in addition to the NSF included networking, and storage solutions companies Seagate, Greenplum, Brocade, and Mellanox.
Google Senior Staff Engineer Jerry Zhao presents at WBDB2012. Image: Ron Hawkins, SDSC. |
“This is an important first step toward the development of a set of benchmarks that will be needed to move ahead in a unified fashion when it comes to big data applications,” said Chaitan Baru, head of the program committee for the WBDB2012. Baru is an SDSC Distinguished Scientist and director of the CLDS.
“Data is emerging as the key differentiator for creating value for enterprises,” said Milind Bhandarkar, chief scientist of Greenplum, a corporate sponsor of CLDS. “The storage scalability, along with processing flexibility offered by various big data platforms, has introduced a large variety of use cases. Choosing an appropriate data platform that meets the performance and scalability needs of an organization is not a trivial problem. Therefore, having an industry-standard suite of benchmarks that represent real workloads is most important. WBDB has taken an important first step of assembling a gathering of experts.”
One key topic of discussion focused on the range and types of applications and associated data In a recent program solicitation, the NSF described big data as “large, diverse, complex, longitudinal, and/or distributed datasets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future.”
SDSC’s Chaitan Baru presents at the WBDB2012. Image: Ron Hawkins, SDSC. |
“The move to data-driven science and decision-making has made the big data issue critical to all areas of science as well as enterprise applications,” said SDSC’s Baru. “Big data and data-intensive research is here to stay, and our ultimate goal is to provide clear and objective information to help characterize and understand hardware and system performance, as well as price and performance across the board, including computing, network connectivity, and data storage systems.”
WBDB2012 was hosted by Brocade at their Executive Briefing Center. “This was an excellent forum to begin to determine big data metrics, and we were honored to sponsor the workshop,” said Scott Pearson, director of big data solutions, and a member of the big data strategy team at Brocade, a leading vendor of networking technologies.
SDSC’s second big data benchmarking workshop will be held in December 2012 in Pune, India, hosted by Persistent Systems. The community is also beginning to organize itself via regular phone conferences. For further information about the workshop and the regular meetings, visit the CLDS website (http://clds.sdsc.edu).
As an Organized Research Unit of UC San Diego, SDSC works with industry and government, as well as academia. Industry researchers and representatives interested in learning more about SDSC’s resources and expertise should contact Ron Hawkins at rhawkins@sdsc.edu or 858 534-5045.
About SDSC
The San Diego Supercomputer Center (SDSC) at the University of California, San Diego, is considered a leader in data-intensive computing and all aspects of ‘big data’, which includes data integration, performance modeling, data mining, software development, workflow automation, and more. SDSC supports hundreds of multidisciplinary programs spanning a wide variety of domains, from earth sciences and biology to astrophysics, bioinformatics, and health IT. With its two newest supercomputer systems, Trestles and Gordon, SDSC is a partner in XSEDE (Extreme Science and Engineering Discovery Environment), the most advanced collection of integrated digital resources and services in the world.
Available for comment:
Chaitan Baru
858 534-5082 or baru@sdsc.edu
Media Contacts:
Jan Zverina, SDSC Communications
858 534-5111 or jzverina@sdsc.edu
Warren R. Froelich, SDSC Communications
858 822-3622 or froelich@sdsc.edu
Categories
Archive
Related Links
UC San Diego: http://www.ucsd.edu/
Center for Large-scale Data Systems Research: http://clds.sdsc.edu/
National Science Foundation: http://www.nsf.gov/