SAN DIEGO -- Suppose that in order to use the appliances in your kitchen, an electrical engineer needs to tear down the wall and splice in wiring for the appropriate voltage converter.
Then suppose you're running a restaurant out of that kitchen.
That, according to those in the know, is what it's like to work with life-science information today, where the absence of a common programming language makes finding biological data a dicey prospect.
Enter Sun, LabBook, TimeLogic, and others, to spearhead the genesis of a common language using XML.
Sun has recruited IBM (IBM), Millennium Pharmaceuticals (MLNM), Affymetrix (AFFX), Accenture and 35 others in an effort called I3C, which promises to accelerate genetic research by helping researchers obtain data using a standard language.
Without this language, the fruits of biotechnology have withered on the vine, said Jeff Augen, director of business strategy at IBM Life Sciences.
"There's been a promise in biotech that has been broken: personalized meds," Augen said, referring to medicine that can be tailor-made for individual patients. In fact, the ability to do this is considered one of the first practical advantages to emerge from the Human Genome Project.
In order to accomplish this, biotech researchers need to be able to easily search all existing gene sequence information. But the human genome contains a massive amount of data -- about 3 billion pairs of the four nucleotides that make up DNA -- and the lack of a common language makes a tough job that much tougher.
The sprawling Human Genome Project resulted in over 400 individual databases at companies and institutions around the world. Each contained sequences and information about the jobs that genes and proteins perform, but each did so in its own programming language. Needless to say, that retards the ability of researchers to easily share their information.
Without some kind of uniform code, researchers have to write storage rules that are different every time, Augen said. "That's silly because then no one else could go get the data."
Even small companies are required to work with increasing amounts of information.
"Small companies are coming to us all the time and the first thing they want to do is buy a teraflop-size supercomputer," Augen said.
Other attempts have been made at codifying the language, but Augen said that there were still so many of them -- 14 at last count -- that the only standard out there right now is chaos.
Rosetta Inpharmatics (RSTA), which was recently purchased by Merck, created the Genetic Expression Markup Language, or GEML. It received an endorsement from Nature magazine, which seemed to herald its widespread acceptance. In the event, it never happened.
So now IBM has surrounded itself with 40 key players in the hope of cutting this Gordian knot.
"If you get enough horsepower behind a consortium then a set of standards becomes a standard," Augen said.
But that doesn't mean companies like Rosetta are excluded. In fact, Augen said he's trying to rally them to the cause. Several other companies also wrote their own language, and the consortium plans to utilize the best aspects of each.
"We're not a standards body where people pay to get certified. We're trying to drive a consensus," Augen said.
Incyte, (INCY) which is part of the consortium, has created a framework called the Genomic Knowledge Platform and will contribute what it can to a standard language.
"We don't want to ignore the valuable work done in other consortia," said Tim Clark, vice president of informatics at Millennium Pharmaceuticals (MLNM).
Members of the consortium expect to complete a language for expression array data in 15 months.
Augen, for one, smells success.
"There's no shortage of scientific input," Augen said. "I'm confident we have an accurate view."