The genetic data posted online seemed perfectly anonymous — strings of billions of DNA letters from more than 1,000 people. But all it took was some clever Web sleuthing for a genetics researcher to identify five people he randomly selected from the study group.
Not only that, he found their entire families, even though the relatives had no part in the study. In all, he identified nearly 50 people.
The researcher didn't reveal the people's names, but the exercise illustrates the difficulty of protecting volunteers' privacy in medical research when the genetic information they provide must be public so scientists can use it.
Other reports have identified people whose genetic data were online, but none had done so using such limited information: the long strings of DNA letters, an age and a state of residence.
The exercise was reported this week in the journal Science.
"I've been worried about this for a long time," said Barbara Koenig, a researcher at the University of California at San Francisco, who studies issues involving genetic data. "We always should be operating on the assumption that this is possible."
The data are from the 1000 Genomes Project, which is collecting genetic information from people worldwide and posting it online so researchers can use it freely. Complete family trees were deduced using just the participants' ages, their states, a genealogy website and Google searches.
Though the methods for extracting relevant genetic data from the raw genetic sequence files were specialized enough to be beyond the scope of most laypeople, no one expected it to be so easy to zoom in on individuals.
"We are in what I call an awareness moment," said Eric D. Green, director of the National Human Genome Research Institute at the National Institutes of Health.
The project was the inspiration of Yaniv Erlich, a researcher at the Whitehead Institute, which is affiliated with the Massachusetts Institute of Technology.
He says he is a strong advocate of data sharing and would hate to see genomic data locked up. But when his lab developed a new technique, he realized he had the tools to probe a DNA database. And he could not resist trying.
The tool helped him find a type of DNA pattern that looks like stutters among billions of chemical letters in DNA. Those little tandem repeats are inherited.
Genealogy websites use repeats on the Y chromosome, the one unique to men, to identify men by their surnames, an indicator of ancestry. Any man can submit the short tandem repeats on his Y chromosome and find the surname of men with the same DNA pattern. The sites enable men to find ancestors and relatives.
So, Erlich asked, could he take a man's entire DNA sequence, pick out the short tandem repeats on his Y chromosome, search a genealogy site, discover the man's surname and fully identify him?
He tested it with the genome of Craig Venter, a DNA sequencing pioneer who posted his own DNA sequence on the Web. He knew Venter's age and state of residence. Bingo! Two men popped up in the database: One was Craig Venter.
"Out of 300 million people in the United States, we got it down to two people," Erlich said.
He and his colleagues calculated that they would be able to identify, from just DNA sequences, the last names of about 12 percent of middle class and wealthier white men — the population that tends to submit DNA data to genealogical sites. Then by combining the men's last names with their ages and states of residence, the researchers should be able to narrow the search to a few likely individuals.
On the Web and publicly available are DNA sequences from subjects in the 1000 Genomes Project. Ages were included and all the Americans lived in Utah, so the researchers knew their state.
Erlich began with one man from the database. He got the Y chromosome's short tandem repeats and then went to genealogy databases and searched for men with those same repeats. He got surnames of the paternal and maternal grandfather.
A Google search revealed the rest.
"Oh my God, we really did this," Erlich said.
He and his colleagues went on to get detailed family trees for others.
Amy L. McGuire, a lawyer and an ethicist at Baylor College of Medicine, called for more public discussion of the situation: "To have the illusion you can fully protect privacy or make data anonymous is no longer a sustainable position."