Skip Navigation

Features

Big Data - OverloadMichael Gibbs

Big Data - Overload (continued)

Movelets represent short bursts of motion data, roughly analogous to the phonemes that make up words. Breaking down the voltage readouts into movelets made manageable what would otherwise have been an ocean of data. “We sample the accelerometer data 10 times per second, so for three axes we’re gathering on the order of 30 observations per second,” says Crainiceanu. “And let’s say we want to monitor hundreds or thousands of people for a week, or a month, with their data continually being uploaded via the Web, for example.” His team’s movement-recognition algorithm essentially can crunch all these data—terabytes’ worth, for a large study— into relatively compact histories of distinct motions (now sitting … now getting up … now walking…), just as a speech recognition algorithm can condense a storage-hogging raw audio recording into a few pages of text.

Crainiceanu’s colleague Rafael Irizarry, PhD, a professor in Biostatistics, faces a similar challenge when he helps biologists sift through gene-sequencing data. “Modern gene sequencing technology is generating such enormous datasets now that biologists are having a hard time saving it on disks; NIH has even been having meetings with experts in the field to figure out how we’re going to store all these data or whether it would be more cost-effective just to generate it again whenever we need it.”

Genomic datasets also can be devilishly hard to analyze. Modern sequencing devices typically generate raw data that represent the color and intensity of fluorescent reporter molecules linked to short stretches of DNA; these intensity levels have to be interpreted into “reads” of the GATC genetic code. Each of these short, not necessarily error-free readouts of DNA then must be pattern-matched to the right location on a three-billion-base-pair reference genome—a bit like finding the right spot for a tiny piece in a football-field-sized jigsaw puzzle.

“When I first got one of these datasets,” Irizarry says, “I wrote my own little software routine to handle it and I ran it and waited … and then realized that it was going to take six months to finish!” Irizarry soon hired a computer scientist, Ben Langmead, MS, who has expertise in solving this kind of problem quickly. Their group, working with Johns Hopkins Medicine geneticist Andrew Feinberg, MD, MPH ’81, has since been putting out a steady stream of high-profile papers on the genetics and epigenetics of tumor cells. (Epigenetics refers to reversible DNA-modifications that silence some genes and let others be active; derangements of the normal epigenetic patterns in cells may be as important as genetic mutations in promoting cancers.)

And then there is the uncertain value of some ultra-large datasets. “They often come with lots of complications and biases that don’t exist in smaller datasets,” says Scott L. Zeger, PhD, the former chair of Biostatistics who is now the University’s vice provost for Research. “A large observational study could be much less informative about the effects of a treatment than a smaller dataset from a placebo-controlled clinical trial, for example,” he says. Even among clinical trials, he adds, the traditional single-center study tends to be less noisy than the multi-center studies that are increasingly the norm in many areas of health research.

Comments

This forum is closed

Read about our policy on comments to magazine articles.

design element
Online Extras

Alain Labrique

Alain Labrique

Alain Labrique shows off a trove of low-cost technological treasures that support research from Kenya to Bangladesh.

Watch Now

How would you rate this?

Average: 5/5 stars (3 votes)

Talk to Us

Amazed? Enthralled? Disappointed? We want to hear from you. Share your thoughts on articles and your ideas for new stories:

Download the PDF

Get a copy of all Feature articles in PDF format. Read stories offline, optimized for printing.

Download Now (4.2MB)