Finding a Needle in an Information Explosion

By Kurt Kleiner

Computerized databases, genomics, high-resolution medical imaging and other scientific advances have created new research and treatment opportunities, but also one big challenge—how to extract meaningful information from massive amounts of data.

Fifty years ago, researchers had to deal with filing cabinets full of paper forms. Now they routinely deal with digital databases containing terabytes. (A terabyte is 1,000 gigabytes of information; a standard DVD can contain about 4.5 gigabytes.) That much electronic information creates problems both for computer scientists and for statisticians.

“Our group works on six or seven very large data sets, and the problem is that the size of these data sets increases exponentially over time. We now see routinely, 1 terabyte, 10 terabyes, 100 terabytes, even one petabyte,” says biostatistican Ciprian Crainiceanu.

“This challenges our ability even to manage this data,” says biostatistician Karen Bandeen-Roche. “Novel computing strategies and novel statistical strategies are needed.”

Researchers working in genomics and proteomics have to track tens of thousands of data points per individual. Medical scans generate hundreds of thousands of pixels of data. And as electronic medical databases are compiled and networked, huge sets of medical records are becoming available for research.

This much data can overwhelm standard statistical packages, which weren’t designed to deal with such enormous datasets. It can also challenge a researcher looking for real relationships in the mass of data. “How does one tease out scientifically important effects from statistically significant effects?” asks Bandeen-Roche. Complicating the challenge is the genesis of the data collected: “Many large data resources are not collected with the scientific ends to which they may be applied explicitly in mind,” she says.

Researchers have had to deal with big datasets for years in fields such as astronomy and genomics, says biostatistician Rafael Irizarry. But the issue of huge datasets has since become more general.

Statisticians and computer scientists are beginning to team up to tackle the problem, and they are even crossing disciplines. For example, predicting health outcomes from medical imaging datasets and predicting individual movie preferences based on a database of other viewers’ choices might have some techniques in common, says Irizarry.

One thing’s for sure, says Crainiceanu: More researchers who specialize in such problems are needed. “There are very few people who think from the perspective of analyzing large datasets,” he says. “[There] is a lack of well-qualified people in this area.”