My DNA, a fabulously long chain of amino acids, a copy of which is contained in every one of my cells, contains a large percentage of the information required to produce me, Ben Goertzel.

This is an amazing thing, really.

Extract my DNA from any one of my cells, and feed it into a “human producing machine,” and out comes a clone of Ben Goertzel, lacking my knowledge and experience, but possessing all my physical and mental characteristics. Of course, we don’t have a human producing machine of this nature just yet, but the potential is there: DNA seems to encode most of the information required to produce a human being.

This is the glory and the romance underlying the Human Genome project, a huge initiative launched in 1990, which aims to chart the whole human genome, to map every single amino acid in the DNA of some sample of human beings. No one could doubt the excitement of this quest: It has a simplicity and grandeur similar to that of putting a man on the moon.

Once you get past the excitement and mystique and into the details, however, the Human Genome Project slowly begins to seem a little less tremendous. One realizes that the actual mapping of the genome is only a very small part of the task of understanding how people are made, and that, in fact, the design of the “human-producing machine” is a much bigger and more interesting job than the complete mapping of examples of the code that goes into the machine. In other words, embryology is probably a lot subtler than genetics, and in the end, much like putting a man on the moon, the Human Genome Project is a task whose scientific value is not quite equal to its cultural and psychological appeal.

But science moves fast these days…. As the excitement of having mapped the human genome fades into a matter-of-fact acceptance, the genetics community is looking ahead to what’s called post-genomic biology. The next big challenge: figuring out how the genetic code actually does anything. How do these sequences of amino acids decode themselves, making use of other molecules in their environment to create organisms like us, or even simpler one-celled organisms? The completion of the human genome project was one of those ends that was actually a beginning. It put us in a position where we’re able to finally start asking the really interesting questions.

This is a very exciting area of research – and a tremendously difficult one as well. As yet there are no tales of tremendous triumph – only some minor victories, a lot of hard work and furious innovation, and the tremendous promise of infinite victories to come. But the progress made so far has many lessons to teach – for example, regarding the remarkably tight interrelation between computer technology and biological research. At the end of the chapter I’ll briefly discuss some of the work my colleagues and I are now doing, applying advanced AI technology to the integrated analysis of various types of biological data, with a focus on genetics and proteomics.

The Human Genome Project originally was planned to last 15 years, but rapid technological advances have accelerated the expected completion date to 2003. The project goals are multifold: to identify all the more than 100,000 genes in human DNA, determine the sequences of the 3 billion chemical base pairs that make up human DNA, store this information in databases, and develop tools for the analysis of this huge amount of data. Some resources have also been devoted to exploring the ethical, legal, and social issues that may arise from the project.

Of course there were many milestones along the path to completion of the Human Genome Project. Befitting the accelerating pace of scientific progress, most of these occurred not long before the completion of the sequencing of the genome itself. For instance, I recall the day in mid-2000 when newspapers announced the mapping of Chromosomes 16 and 19 on the human genome. Human chromosome 19 contains about 2% of the human genome, including some 60 genes in a gene family involved in detoxifying and excreting chemicals foreign to the body. Chromosome 16 contains about 98 million bases, or some 3% of the human genome, including genes involved in several diseases, such as polycystic kidney disease (PKD), which is suffered by about 5 million people worldwide and is the most common potentially fatal disease caused by a defect in a single gene.

Since that time more and more similar results have piled up. The initial rough map of the genome is getting refined, bit by bit, using sophisticated “gene recognition” software that identifies sequences of base pair amino acids that represent genes, along with lots of good old biological intuition.

Clearly, these are major advances in gene mapping, with potential implications for helping remedy diseases. But -- what do they really mean?

An analogy may be instructive. Suppose a team of scientists goes to another planet, and discovers a lot of really long strips of paper lying around on the ground, each one with strange markings on it. Suppose they then notice some big steel machines, with slots that seem to be made to accept the strips of paper. After some experimentation, they figure out how the machines work: You feed the strip of paper in one end, and then after a few hours, the machine spits out a completely functional living organism. Amazing!

So, the scientists embark on a project to figure out what’s going on here. Of course, they have no idea what’s going on inside the machines, and all their efforts to bust the machines open meet with failure. So, instead, they devote themselves to completely recording all the markings on the strips of paper in their notebooks, hoping that eventually the patterns will come to mean something to them. When they achieve 10%, then 20%, then 50% completion of their task of recording these meaningless patterns in their notebooks, they declare themselves to have made significant scientific progress.

And occasionally, along the way, they make some small discoveries about the impact that the markings have on the organisms the machine produces. If you snip off the first 10% of the strip, the organism produced is more likely to be defective than if you strip off the last 10%. The region of the strip that’s 2000 to 3000 markings from the end seems to have something to do with the organism’s head: it seems to be very different for organisms with very different heads, and so forth. But these kinds of general observations don’t really get them very far toward an understanding of what the amazing steel machines are actually doing.

If you’re somewhat familiar with computers, a variation on this analogy may be instructive. Consider a large computer program such as Microsoft Windows. This program is produced via a long series of steps. First, a team of programmers produces some program code, in a programming language (in the case of Microsoft Windows, the programming language is C++, with a small amount of assembly language added in). Then, a compiler acts on this program code, producing an executable file – the actual program that we run, and think of as Microsoft Windows. Just as with human beings, we have some code, and we have a complex entity created by the code, and the two are very different things. Mediating between the code and the product is a complex process – in the case of Windows, the C++ compiler; in the case of human beings, the whole embryological and epigenetic biochemical process, by which DNA grows into a human infant.

Now, imagine a “Windows Genome Project,” aimed at identifying every last bit and byte in the C++ source code of Microsoft Windows. Suppose the researchers involved in the Windows Genome Project managed to identify the entire source code, within 99% accuracy. What would this mean for the science of Microsoft Windows?

Well, it could mean two different things.

1) If they knew how the C++ compiler worked, then they’d be home free! They’d know how to build Microsoft Windows!

2) On the other hand, what if they not only had no idea how to build a C++ compiler, but also had no idea what the utterances in the C++ programming language meant? In other words, they had mapped out the bits and bytes in the Windows Genome, the C++ source code of Windows, but it was all a bunch of gobbledygook to them. All they have a is a large number of files of C++ source code, each of which is a nonsense series of characters. Perhaps they recognized some patterns: older versions of Windows tend to be different in lines 1000-1500 of this particular file. When file X is different between one Windows version and another, this other file tends to also be different between the two versions. This line of code seems to have some effect on how the system outputs information to the screen. Et cetera.

Our situation with the Human Genome Project is much more like Option 2 than it is like Option 1.

The scientists carrying out the Human Genome Project are much like the scientists in my first parable above, who are busily recording the information on the strips of paper they’ve found, but have no idea whatsoever what’s going on inside the magical steel machines that actually take in the strips of paper and produce the alien animals.

Moving beyond analogies, let’s talk briefly about a real project related to the Human Genome Project: the Fly Genome Project. In the 24 March 2000 issue of Science magazine, in a series of articles jointly authored by hundreds of scientists, technicians, and students from 20 public and private institutions in five countries, the almost-complete mapping of the genome of the fruit fly Drosophila melanogaster was announced. Hurray! Some other species of fly have also been similarly mapped.

The fruit fly Drosophila has a big history in genetics; its study has yielded a long series of fundamental discoveries, beginning with the proof, in 1916, that the genes are located on the chromosomes. Now all of its 13,601 individual genes have been enumerated.

This achievement may have some practical value. In a set of 289 human genes implicated in diseases, 177 are closely similar to fruit fly genes, including genes that play roles in cancers, in kidney, blood, and neurological diseases, and in metabolic and immune-system disorders.

But, my point is: OK, we have the fruit fly genome mapped, to within a reasonable degree of accuracy. Now what? Wouldn’t it be nice to understand the process by which this genome is turned into an actual fly?

The Human Genome Project includes in its umbrella a focus on data analysis. This refers mainly to designing and implementing computer programs that study the huge sequences of amino acids that biologists have recorded, and look for patterns in these sequences. This is fascinating work, but it is a long way from a principled understanding of how DNA is turned into organisms.

For example, Luis Rocha and his colleagues at Los Alamos National Labs are working on identifying regions of the genome that are similar to each other, based on statistical tests. This kind of similarity mining gives biologists a hint that two parts of the genome may work together at some stage during the process of forming an organism. Similar statistical methods may be useful for recognizing where genes begin and end in a collection of amino acid sequences – a problem that’s surprisingly tricky, and may require comparison of human sequences with sequences from related species such as the mouse or the fruit fly.

The relation between 1-D sequences of amino acids and 3-D structures formed from these sequences is hard for scientists to understand even on the simplest level. The big problem here is what’s known as “protein folding.” Many structures in DNA encode instructions for the formation of proteins. But no one knows how to predict, from the series of molecules making up a protein, what that protein is going to look like once it folds up in three-dimensional space. This is important because many proteins that look very different on the one-dimensional, molecular-sequence level may look almost identical once they’ve folded up in 3 dimensions. Thus, by focusing on sequence-level analysis, researchers may be scrutinizing differences that make no difference. Currently, only very few 3-D protein motifs can be recognized at the sequence level.

Basically, we barely understand the simplest stages of the production of 3-dimensional structures out of DNA, let alone the complex self-organizing processes by which DNA gives rise to organisms. This is OK – mapping DNA is still of some value even in this situation – but it must be clearly understood. In practical terms, our lack of knowledge of embryological process greatly restricts the use we can make of observed correlations between genes and human characteristics such as diseases. There are diseases whose genetic correlates have been known for decades, without any serious progress being made toward treatment. For DNA researchers to announce that they’ve mapped the portion of the human genome that is correlated to a certain disease, doesn’t mean very much in medical terms.

Does all this mean that the Human Genome Project is bad – wasted money, useless science? Of course not. However, it does suggest that perhaps the government is allocating its research money in an imbalanced way. By pushing so hard and so fast for a map of the human genome, while not giving a proportionate amount of research money to studies in embryology and the general study of self-organizing pattern formation, the US government is guaranteeing that we are going to arrive at a map of the human genome that we cannot use in any effective way.

And this brings us to some very deep and fascinating questions in the philosophy of science. As the biological theorist Henri Atlan pointed out in an essay written right around the start of the Human Genome project, the mapping of the human genome is a very reductionist pursuit. In fact it is almost the definition of reductionism -- the construction of a finite list of features characterizing human beings. All of humanity, reduced to a list of amino acids in order – imagine that! Wow!

On the other hand, the formation of organisms out of DNA is a very non-reductionist process, which biologists from the last century attributed to a “vital force” underlying all living beings. Modern scientists have still not come to grips with the scientific basis for this apparent vital force, which builds life out of matter. There are disciplines of science – cybernetics, systems theory, complexity science – which attempt to solve this problem, but these have not been funded nearly as generously as gene mapping, and they have not been linked in any serious way with the work on data analysis of genetic sequences. I believe that the study of embryology has the potential to overthrow many of our established ways of doing science, by shifting the focus of attention to complex, self-organizing processes and the emergence of structure. But this “complexity revolution” is something that the scientific establishment seems determined to put off as long as it possibly can.

In this sense, one can see the Human Genome Project as an outgrowth of modern cultural trends extending beyond the domain of science. It’s an expression of the quest for understanding, and also of the illusion that reductionism is the path to understanding. It’s an expression of our inability as a culture to come to grips with the wholeness of life and being, and focus on the seemingly magical processes by which life is formed from the nonliving, and structure emerges from its absence.

But, the wonderful thing about science is that it’s self-correcting. Ultimately science is all about the data and the conclusions that can be drawn from it. We’ll go ahead collecting data on the human genome, but year by year, the biological communit will place more and more focus on how the genome interacts with its chemical environment to self-organize into the organism. Some new biologists coming into the field already have the feeling that gene sequencing is old hat. New technologies like microarrays allow us to study – only partially and haltingly right now, but it’s a start – how genes interact and interregulate in the actual living process of the cell. I think of this as “the new genetics” – genetics that reaches up and tries to be systems biology. And it time it will succeed. Eventually, as research along these lines matures, we really will understand not just what amino acids make up a human being’s genetic material, but how a human being is made.

It’s hardly shocking that post-genomic biology is enabled by advanced computer technology every step of the way. After all, most branches of physical science have become thoroughly computerized – little of modern chemistry and physics could exist without computers. But it’s instructive to see just how many roles computers have played in the new genetics. Firstly, it’s only because of recent computer engineering and robotics driven advances in experimental apparatus design that we are able to gather significant amounts of data about how genes build organisms. New “microarray” technologies like DNA chips (built like silicon chips) and spotted microarrays (built with robot arms) allow us to collect information regarding the expression of genes at different times during cell development. But this data is too massive and too messy for the human mind to fully grasp. Sophisticated AI software, used interactively by savvy biologists, is needed to analyze the results.

It’s not hard to see what the trend is here. Biological experiments, conducted using newfangled computer technology, are spinning us an increasingly detailed story of the microbiological world – but it’s a story that only increasingly advanced AI programs will be able to understand in full. Only by working with intelligent software will we be able to comprehend the inner workings of our own physical selves.

Gene therapy, the frontier of modern medicine, relies on the ability to figure out what combinations of genes distinguish healthy cells from diseased cells. This problem is too hard for humans to solve alone, and requires at very least advanced statistical methods, at most full-on computer cognition. The upshot? Rather than fearing AI’s as movies like 2001 have urged us to do, we may soon be thanking AI programs for helping find the cure for cancer.

Artificial intelligence programs have never even come close to equaling humans’ common sense about the everyday world. There are two main reasons for this. First, most AI programs have been written to excel only in one specialized kind of intelligence – like playing chess, or diagnosing diseases -- rather than to display general intelligence. And second, even if one does seek to create an AI program with general intelligence, it still is just a software program without any intuition for the human world. We homo sapiens sapiens have a special feeling for our physical and social environment -- for simple things like the difference between a cup between a bowl, or between happiness and contentment. AI programs, even those that push towards general intelligence, can’t help lacking this intuition.

But the world of molecular biology is not particular intuitive to human beings. In fact it’s complex and forbidding. It has much of the ambiguity of everyday life – there is not as much agreement as one would think about the meanings of various technical terms in genetics and molecular biology. But this ambiguity is not resolved by a simple tacit everyday understanding, only by a very advanced scientific intuition. The number of different patterns of genetic structure and activity boggles even the ablest human mind. In this domain, an artificial intelligence has much more to offer than in the world of everyday human life. Here in the microworld, human intuition is misleading as often as it is valuable. Artificial intuition can be tuned specifically to match the ins and outs of introns and exons, the turns and twists of DNA.

The new genetics has many aspects, but perhaps the most exciting of them all is the emerging study of gene and protein expression. The terminology here is both evocative and appropriate: Just as with a person, it’s not what a gene does when it’s just sitting there that’s interesting, it’s what a gene does when put in a situation where it can express itself!

At any given moment, most genes are quiet, doing nothing. But some are expressed, some are active. Now, using the new experimental tools, we can tell which. We can see how many genes are expressed at a given moment … and then a little later … and then a little later. In this way we can make a kind of map of genetic dynamics as it evolves. And by analyzing this map, using advanced computer software, a lot of information about how genes go about their business can be understood. Which genes tend to stimulate which other genes. Which ones tend to act in groups. Which ones inhibit which other ones, preventing them from being expressed. And by applying the same analysis tools to proteins instead of tools, one can answer the same questions about proteins, the molecules that genes create and send around to do the actual business of building cells. These kinds of complex interactions between genes, and between genes and proteins, are the key to the decoding of genomes into organisms – which is, after all, what genomes are all about.

All this complexity is implicit in the genetic code itself, but we don’t know how to interpret the code. With microarrays, we can watch the genetic code interpret itself and create a cell, and by analyzing the data collected in this process, we can try to figure out exactly how this process of interpretation unfolds. And the potential rewards are great– the practical applications are tremendous, from drug development to disease diagnosis, genetic engineering and beyond.

It’s a straightforward enough idea, but the practical pitfalls are many. A huge host of tools from mathematics and computer science have been unleashed on the problem, both by researchers at major academic firms, and by companies like Rosetta Inpharmatics (recently acquired by Merck, the major pharmaceutical firm) and Silicon Genetics, a gutsy and clever California start-up. New data analysis techniques come out every couple months, each one with its own strengths and weaknesses.

 
 
 

  Affymetrix GeneChip System
 
 

It would be hard to overestimate the revolutionary nature of the new experimental tools – microarrays -- underlying the gene expression revolution. And the same tools, with minor variations, are also being made work for proteomic analysis, the study of protein expression. For the first time, with these new devices, biologists are able to study thousands or even millions of different molecules at once, and collect the results in a systematic way.

Chemists have long had methods for carrying out many simultaneous chemical reactions. Most simply, trays can be created with 96 or 384 wells, each containing a different chemical and a unique bar code. The last few years, however, have seen the development of methodologies that push far further in this direction –making possible experiments that scientists only a few years ago would have called impossible. The application of these new methodologies to the analysis of gene and protein data has led to a new area of research that may be called massively parallel genomics and proteomics.

Most of the work done so far has been in genomics; the extension to proteomic analysis is more recent. So I’ll talk about microarrays as used for genomic analysis; the proteomics case is basically the same from the point of view of data analysis, though vastly more difficult from the point of view of experimental apparatus biomechanics. (Many proteins are much more difficult than DNA to induce to stick on the surfaces used in these instruments.)

There are several types of microarrays used in genomics, but they all embody a common methodology. Single stranded DNA/RNA molecules are anchored by one end to some kind of surface (a chip or a plate depending on the type of apparatus). The surface is then placed in a solution, and the molecules affixed to the chip will seek to hybridize with complementary strands (“target molecules”) floating in the solution. (Hybridization refers to the formation of base pairs between complementary regions of two strands of DNA that were not originally paired).

Affymetrix’s technology, pioneered by Dr. Stephan Fodor, involves making DNA chips in a manner similar to the manufacture of semiconductor chips. A process known as “photolithography” is used to create a huge number of molecules, directly on a silicon wafer. A single chip measuring 1.28 cm X 1.28 cm can hold more than 400,000 “probe” molecules. The procedure of gene chip manufacture has been fully automated for a while now, and Affymetrix manufactures 5-10,000 DNA chips per month.

Affymetrix DNA chips have a significant limitation in terms of the size of the molecules that can be affixed to them. So far they’re normally used with DNA/RNA segments of length 25 or less. Also, they are very expensive. It currently costs about $500,000 to fabricate the light masks for a new array design, so their technology is most appropriate when the same chip needs to be used again and again and again. The main example of this kind of use case is disease diagnosis.

On the other hand, spotted microarrays, first developed by Pat Brown at Stanford, are ordinary microscope slides on which robot arms lay down rows of tiny drops from racks of previously prepared DNA/RNA samples. At present this technology can lay down tens of thousands of probe molecules, at least an order of magnitude off from what Affymetrix can do. The advantage of this approach is that any given DNA/RNA probe can be hundreds of bases long, and can, in principle, be made from any DNA/RNA sample.

Note the key role of computer technology in both of these cases. Affymetrix uses a manufacturing technique derived from the computer hardware industry, which depends on thorough and precise computer control. Spotted microarrays depend as well on the inhuman precision of robot arms, controlled by computer software. Massively parallel genomics, like the mapping of the human genome itself, is a thoroughgoing fusion of biology and computer science – only here the emphasis is on computer engineering and hardware, whereas gene mapping relied upon fancy software algorithms.

There are other approaches as well. For instance, Agilent Technologies, a spin-off from HP, is manufacturing array makers using ink-jet printer technology. Their approach is interesting in that it promises to make practical the synthesis of a single instance of a given array design. Lynx Corporation is pursing a somewhat Affymetrix-like approach, but circumventing Affymetrix’s patents by using addressable beads instead of a silicon wafer. And so forth. Over the next few years we will see a lot of radical computer-enabled approaches to massively parallel genomics, and time will tell which are most effective.

So how are these massively parallel molecule arrays used? Let’s suppose that, one way or another, we have a surface with a number of DNA/RNA molecules attached to it. How do we do chemical reactions and measure their results?

First, the target molecules are fluorescently labeled, so that the spots on the chip/array where hybridization occurs can be identified. The strength of the fluorescence emanating from a given region of the surface is a rough indicator of the amount of target substance that bound to the molecule affixed to that region. In practical terms, what happens is that an image file is created, a photograph of some sort of the pattern of fluorescence emanating from the microarray itself. Typically the image file is then “gridded”, i.e. mapped into a pixel array with a pixel corresponding to each probe molecule. Then, there is a bit of black art involved in computing the hybridization level for a spot, involving various normalization functions that seem to have more basis in trial-and-error than in fundamentals.

This data is very noisy, however. To get more reliable results, researchers generally work with a slightly more complex procedure. First, they prepare two related samples, each of which is colored with a different fluorescent substances (usually, one green, one red). They then compare the relative amounts of expressed DNA/RNA in the two samples. The ratio of green/red at a given location is a very meaningful number. Using this ratio is a way of normalizing out various kinds of experiment-specific “noise”, assuming that these noise factors will be roughly constant across the two samples.

But even this ratio data is still highly noise-ridden, for a number of reasons beyond the usual risk of experimental error or manufacturing defects in the experimental apparatus. For one thing, there are many different factors influencing the strength of the bond formed between two single stranded DNA/RNA molecules, such as the length of the bonded molecules, the actual composition of the molecules, and so forth. Errors will occur due to the ability of DNA to bind to sequences that are roughly complementary but not an exact match. This can be controlled to some extent by the application of heat, which breaks bonds between molecules – getting the temperature just right will break false positive bonds and not true positive ones. Other laboratory conditions besides temperature can have similar effects. Another problem is that the “probe molecules” affixed to the surface may fold up and self-hybridize, thus rendering them relatively inaccessible to hybridization with the target.

All these issues mean that a single data point in a large microarray data set cannot be taken all that seriously. The data as a whole is extremely valuable and informative, but there are a lot of things that can go wrong and lead to spurious information. This means that data analysis methods, to be successfully applied to microarray data, have got to be extremely robust with respect to noise. None of the data analysis methods in the standard statistical and mathematical toolkits pass muster, except in very limited ways. Much more sophisticated technology is needed – yes, even artificially intelligent technology, software that can build its own digital intuition as regards the strange ways of the biomolecular world.

The payoff for understanding this data, if you can do it, is huge. These data can be used for sequencing variants of a known genome, or for identifying a specific strain of a virus (e.g. the Affymetrix HIV-1 array, which detects a strain of the virus underlying AIDS). They can be used to measure the differences in gene expression between normal cells and tumor cells, which helps determine which genes may cause/cure cancer, or identify which treatment a specific tumor should respond to best. They can measure differences in gene expression between different tissue types, to determine what makes one cell type different than another. And, most excitingly from a scientific viewpoint, they can be used to identify genes involved in cell development, and to puzzle out the dynamic relationships between these genes during the development process.

 

We’ve seen that the actual experimental apparatuses being used in postgenomic biology all come in one way or another out of the computer industry. And that the analysis of large, noisy, complex data sets like the ones microarrays produce can only be carried out by sophisticated computer programs running on advanced machines – no human being has the mind to extract subtle patterns from such huge, messy tables of numbers. There is also another crucial dependency on computer technology here: the role of the internet. The biology community has come to use the Net very heavily for data communication – without it, there is no way research could proceed at anywhere near its current furtive pace.

Perhaps you’re a bit of a computer hacker and you want to try out your own algorithms on the data derived from microarray experiments on the yeast genome during cell development. Well, you’re in luck: the raw data from these experiments are available online at http://cmgm.stanford.edu/pbrown/sporulation/additional/spospread.txt. Download it and give it a try! Or check out Rosetta’s site, www.rii.com, and download some sample human genome expression data. Or, perhaps your interests are less erudite, and you’d simply like to view the whole human genome itself? No problem, check out the Genome Browser at http://genome.ucsc.edu/goldenPath/hgTracks.html.

But gene sequence information, and the quantitative data from gene expression experiments, is only the beginning. There’s also a huge amount of non-numerical data available online, indispensable to researchers in the field. When biologists interpret microarray data, they use a great deal of background knowledge about gene function – more and more knowledge is coming out every day, and a huge amount of it is online for public consumption, if you know where to look. Current automated data analysis tools tend to go purely by the numbers, but the next generation of tools is sure to boast the ability to integrate numerical and non-numerical information about genes and gene expression. As preparation for this, biologists in some areas are already working to express their nonquantitative knowledge in unambiguous, easily computer-comprehensible ways.

This exposes the dramatic effect the Net is having on scientific language. Yes, the net is rushing the establishment of English as the world’s second language, but something more profound than that is happening simultaneously. The Net demands universal intercomprehensibility. In biological science, this feature of Internet communications is having an unforeseen effect: it’s forcing scientists working in slightly different areas to confront the ideosyncracies of their communication styles.

Compared to ordinary language, scientific language is fairly unambiguous. But it’s far from completely so. An outsider would understandably assume that a phrase like “cell development” has a completely precise and inarguable meaning – but biology is not mathematics, and when you get right down to it, some tribes of researchers use the term to overlap with “cell maintenance” more than others do. Where is the borderline between development and maintenance of a cell? This issue and hundreds of others like it have come up with fresh vigor now that the Internet is shoving every branch of biological research into the faces of researchers in every other branch. As a result of the Internetization of biological data, a strong effort is underway to standardize the expression of non-numerical genetic data.

One part of this is the Gene Ontology Project, described in detail at http://www.geneontology.org/ . In the creation of this project, one thorny issue after another came up – a seemingly endless series of linguistic ambiguities regarding what would at first appear to be very rigid and solid scientific concepts. What is the relation of “development” versus “maintenance”, what does “differentiation” really mean, what is the relation of “cell organization” and “biogenesis”, etc. The outcome of this quibbling over language? A much more precise vocabulary, a universal dictionary of molecular biology. Ambiguity can’t be removed from the language used to describe cells and molecules, but it can be drastically reduced through this sort of systematic effort. And the result is that genes from different species can be compared using a common unambiguous vocabulary. The fly, yeast, worm, mouse and mustard genomes have all been described to a significant extent in standardized Gene Ontology language, and the human genome can’t be far behind. Soon enough, every gene of every common organism will be described in a “Gene Summary Paragraph”, describing qualitative knowledge about what the gene does in carefully controlled language -- language ideally suited for digestion by AI programs.

The standardization of vocabulary for describing qualitative aspects of genes and proteins is a critical part of the computerization of biological analysis. Now AI programs don’t have to have a sensitive understanding of human language to integrate qualitative information about gene function into their analyses of gene sequences and quantitative gene expression data. It’s only a matter of years – perhaps even months, in some cutting-edge research lab -- before the loop is closed between AI analysis of genomic data and the automated execution of biological experiments. Now, humans do experiments, use sophisticated algorithms to analyze the results, and then do new experiments based on the results the algorithms suggest. But before too long, the human will become redundant in many cases. Most of the experiments are predominantly computer-controlled already. The software will analyze the results of one experiment, then order another experiment up. After a few weeks of trial and error, it will present us humans with results about our own genetic makeup. Or, post the results directly to the Web, where other AI’s can read them, perhaps faster than humans can.

All this abstract, complicated technology conspires to provide practical solutions to some very real problems. Genetic engineering is one of the big potential uses. Understanding how genes work to build organisms, we’ll be able to build new kinds of organisms. Frankenfoods, and eventually new kinds of dogs, cats and people, thus raising all kinds of serious ethical concerns.

But there are also applications that are ethically just about unquestionable. From an economic point of view, the main value of microarrays and related technologies right now is as part of that vast scientific-financial machine called the drug discovery process. The path from scientific research to the governmental approval of a new drug is a long, long, long one, but when it’s successfully traversed, the financial rewards can be immense.

Gene therapy is a new approach to curing diseases, and one that hasn’t yet proved its practical worth in a significant way. Although it hasn’t lived up to the over-impatient promises that were made for it 10 years ago, biologists remain widely optimistic about its long-term potential – not only for curing "classic" hereditary diseases, but also widespread diseases such as cancer and cardio-vascular diseases. The concept is scientifically unimpeachable. Many diseases are caused by problems in an individual’s DNA. Transplanting better pieces of DNA into the cells of a living person should be able to solve a lot of problems. A great deal of research has been done regarding various methods to implant genes with the desired characteristics into body cells. Usually the injected gene is introduced within the cell wall, but resides outside the nucleus, perhaps enmeshed in the endoplasmic reticulum. Fascinatingly, the result of this is that the gene is still expressed when the appropriate input protein signal is received through the receptors in the cell wall even though the gene is not physically there in the nucleus with the rest of the DNA.

Aside from the practical issues of how to get the DNA in there in various circumstances, though, there’s also the major issue of figuring out what DNA is responsible for various diseases, and what to replace it with. To understand this, in the case of complex diseases, requires understanding how DNA is decoded to cause cells of various types to form. And this is an understanding that has been very, very hard to come by. The presence of gene and protein expression data from microarray experiments, and sophisticated bioinformatics software, renders it potentially achievable, though still by no means trivial. More precise microarrays and more intelligent data analysis software may render the problem downright straightforward in 5 or 10 or 20 years from now. No one knows for sure.

One thing biologists do, in trying to discover gene therapies, is to compare the genetic material of healthy and disease-affected individuals. A key concept here is the “genetic marker” – a gene or short sequence of DNA that acts as a tag for another, closely linked, gene. Such markers are used in mapping the order of genes along chromosomes and in following the inheritance of particular genes: genes closely linked to the marker will generally be inherited with it. Markers have to be readily identifiable in the organism the DNA builds, not just in the DNA – some classic marker genes are ones that control phenomena like eye color.

Biologists try to find the marker genes most closely linked to the disease, the ones that occur in the affected individuals but not in the healthy ones. They narrow the markers’ locations down step by step. First they find the troublesome chromosome, then they narrow the search and try to find the particular troublesome gene within that chromosome….

It used to be that genetic markers were very hard to find, but now that the human genome is mapped and there are technologies like microarrays, things have become a good bit simpler. Some markers are now standard -- and Affymetrix sells something called the HuSNP Mapping Array, a DNA microarray with probes for many common markers across the human genome already etched on its surface, ready for immediate use. If you have samples of diseased tissue, you can use this microarray to find whether any of a large number of common markers tend to coincide with it. In the past this would have required thousands or millions of experiments, and in many cases it would have been impossible. Now it’s easy, because we can test in parallel whether any of a huge number of gene sequences is a marker for a given disease-related gene. Right now, scientists are using this approach to try to get to the bottom of various types of cancer, and many other diseases as well.

If a disease is caused by one gene in particular, then the problem is relatively simple. One has to analyze a number of tissue samples from affected and healthy people, and eventually one’s computer algorithms will find the one gene that distinguishes the two populations. But not all diseases are tied to one gene in particular – and this is where things get interesting. Many diseases are spread across a number of different genes, and have to do with the way the genes interact with each other. A disease may be caused by a set of genes, or, worse yet, by a pattern of gene interaction, which can come out of a variety of different sets of genes. Here microarrays come in once again, big time. If a disease is caused by a certain pattern of interaction, microarray analysis of cell development can allow scientists to find that pattern of interaction. Then they can trace back and find all the different combinations of genes that give rise to that pattern of interaction. This is a full-on AI application, and it pushes the boundaries of what’s possible with the current, very noisy microarray data. But there’s no doubt that it’s the future.

Gene therapy itself is still in its infancy, and so is microarray technology, and so is AI-driven bioinformatics. But all these areas are growing fast – fast like the Internet grew during the 1990’s. Very fast. Exactly where it’s all going to lead, who knows. But it’s a pretty sure bet that the intersection of medicine, genetics, proteomics, computer engineering, AI software, and robotics is going to yield some fascinating things. We’re beginning to see the rough outlines of early 21’th century science.

 
 
 

  Stephen Fodor
 
 

One of the more interesting figures in the history of microarray technology is Stephen Fodor, the founder of Affymetrix. Fodor is not interesting in the same way that, say, Hugo de Garis is – he’s not a colorful character, wild-eyed with far-off visions. Quite the opposite: he illustrates the ease with which, in the modern biotech industry, radical scientific innovation and the conservative ways of the corporate world have come to seamlessly interact.

The biotech revolution, like the computer revolution with which it has become thoroughly entangled, involves an unprecedentedly tight interaction between the worlds of science, engineering and business. And this interaction is not just about partnerships between organizations, it’s about individual human beings stretching their minds and personalities to encompass diverse, often divergent perspectives.

There’s the creative and exploratory world-view of the scientist, in which solid experimental results or elegant theories are the proof of success. There’s the pragmatic and functional perspective of the engineer, in which a high-quality working system is worth more than anything. And then there’s the sometimes cut-throat vantage of the businessman, in which the bottom line is always a financial one, and Bill Gates is vastly more valuable than Einstein. Typically, historically, these different orientations toward life have resided in different people’s brains. But more and more people each year, in order to achieve their goals, are being forced to internalize all these perspectives, and weave them together into an integrative approach.

In the domain of biotechnology, Fodor wonderfully exemplifies this emerging synthesis. He is a scientist whose groundbreaking scientific/engineering achievements led him into the business world, where he’s now managing the development and marketing of technology based on his initial breakthroughs. His firm, Affymetrix, was one of the most promising of the biotech start-ups of the late 90’s, and shows no sign of slowing down.

Fodor’s career began like that of a typical overachieving young bioscientist. He received his B.S. chemistry in 1978 and his M.S. in biochemistry/biophysics in 1982, both from Washington State University – a solid school, though not a world leading institution. He moved on to Princeton University for his PhD, which he received in 1985. Following a post-doctoral fellowship at Berkeley, he wound up at Affymax Research Institute, where his group led the development of new technologies, oriented towards creating very dense arrays of biomolecules by combining photolithographic methods with traditional chemical techniques. The advantage of packing biomolecules together in very dense arrays is that one can then study a large number of molecules all at once, in a single experiment, as opposed to traditional experimental biology approaches in which one studies one or a handful of at a time. This work was an interesting example of interdisciplinary crossfertilization of ideas, Fodor’s chemistry and biophysics background spurring him to think about the problem differently than traditional biologists would.

As the work became more and more promising, the potential commercial possibilities became more and more clear. If one could affix a large number of different segments of DNA or protein to some surface, in a tightly packed array, then one could effectively experiment on all of them at once, gathering millions of times as much data as was possible using traditional approaches, where one worked with many fewer pieces of DNA or protein at a time. There were still a lot of technical issues to be worked out, but, the viability of the idea was clear. With this in mind, in 1993, Fodor and a group of other Affymax scientists decided to form the firm Affymetrix, dedicated to the creation and dissemination of radical new technology for genomic and proteomic data gathering, based on the research of Fodor and his colleagues.

When Affymetrix was founded, Fodor was Scientific Director, but over time, he found himself becoming more and more involved with the business side of the company, and in 1997 became President and Chief Executive Officer. And the company would seem to have benefited significantly from having a leader with a passion for all aspects of its operations, scientific, engineering, marketing and financial. It’s still the technology, and its potential to transform bioscience as a whole, that gets Fodor most excited. But, a consummate realist, he has realized that focusing on the technology alone is not the optimal way of going about the process of transforming bioscience. Getting the technology out there in use in as many places as possible is just as critical as making the technology effective.

The original motivation for the gene chip work was to create a device that would hold thousands of molecules in place so they could be tested simultaneously to determine which ones were viable drug candidates. Fodor saw how, as he put it, the DNA or protein molecules stuck on the chip could act "as thin strips of molecular Velcro." By seeing which molecules stick to which other ones, one can discover all sorts of things about genes -- detecting mutations, revealing information about diseases or treatments, figuring out which genes interact with which other ones during cell development, etc.

All this began as a chancy, complex experimental procedure and is now fully automated; Affymetrix manufactures 5-10,000 DNA chips per month.

And in his spare time, among other things, Fodor is thinking about the ethical aspects of genomics, a discipline that is well-known as an ethical minefield, with new issues like stem cell research and human cloning popping up every day. In the ethics of genetic research, commercial, scientific and engineering perspectives intersect with humanistic and even spiritual issues, and Stephen Fodor and others with his diverse background are uniquely positioned to deal with such issues in an integrative way.

At a 1999 Princeton University symposium on bioethics, he observed that “Having a commercial background brings a different bent to the ethics around the subject….” To illustrate this, he offered an amusing anecdote. “I was talking to a friend of mine,” he said, “whose father used to run a dry cleaning company near New York City. Every day when the clothes came in, he would go through the pockets of the clothes, and see what he found in there. One day he found a hundred dollar bill. He said this raised a serious ethical question -- whether he was going to tell his partner. [i.e., whether he was going to share the $100 with his partner or not] So, ethics is in the eye of the beholder….”

This little story has the empirical directness of the scientist about it. As a pragmatic businessman, Fodor has long since realized that humanistic sentiments don’t make the business world go around. As a rule, the only principle that can be relied upon to mean anything to a corporation is the maximization of shareholder value. As a scientist, he sees this situation quite plainly as an empirical fact.

Why, then, as an ethically concerned individual, is he relatively unworried by the consequence of this cut-throat attitude for the development of biotechnology? It’s simple. He believes that the power of the technology to do good is far greater than its power to do damage.

One ethical worry associated with genetic analysis is that ambitious parents will use it to overengineer their progeny – killing a fetus if, for example, its genes indicate that it won’t be sufficiently musically or athletically talented. Some people find this unproblematic, others find it repellent. As for Fodor, when asked if there should be regulations on using DNA chips for prenatal screening, he hems and haws a bit, observing that “Prenatal screening is a bigger question than just these chips….” His main concern in this connection is “… personal privacy. I’m not an advocate whatsoever of the possibility of health care organizations doing screening and databasing and letting you know what the options are. I think the best case is that people get the information themselves and decide what to do with it, that the information is in their control. The levels of privacy I think have to be worked out.”

While information privacy is an important concern, it’s perhaps idealistic to think that genetic information, among all medical data, is going to be kept from the vast health care establishment. All in all, where this sort of issue is concerned, one gets the impression that Fodor is slightly bored and not 100% engaged. Rather than worrying about what-ifs, his focus is on building the technology and doing the best things he can with it, and what society as a whole makes of it, is indeed beyond his control. The key point, in his view, is the vastness and diversity of “wonderful commercial and scientific possibilities. We’re in the early days of this…. There’s a tremendous number of medical and scientific applications…. What are the good things you can do? What are the values you can create for people going forwards?” This is what gets Stephen Fodor excited, not worrying about negative possibilities.

With a salary and bonus package pushing a half million dollars a year, and many tens of millions in stock options (much of it fully vested), Fodor is certainly profiting personally from his turn towards the commercial world. And he is clearly experiencing many money and business oriented distractions, such as a recent lawsuit against Incyte (a particularly perplexing lawsuit given Affymetrix’s ongoing partnership with this firm) But in practical terms, in spite of the inevitable over-busyness of his multifaceted role, he is doing his best to work toward realizing beneficial applications of his technology as well as toward personal and corporate profit.

So far the facts would seem to support Fodor’s views about the potential for good in his technology. It is tremendous. The medical applications of DNA chips may well be revolutionary. As Fodor says, "Affymetrix was founded on the belief that understanding the correlation between genetic variability and its role in health and disease would be the next step in the genomics revolution." And the results to back up this vision have started coming in.

For instance, 2 years ago researchers at the Whitehead Institute used DNA chips to distinguish different forms of leukemia based on patterns of gene activity found in cancerous blood cells. This approach has led to real practical benefits, for example in some cases reversing the incorrect diagnoses made by other, cruder methods. And this is only the barest beginning. As Dr. Lander of the Whitehead Insitute says, "the research program aims to lay a foundation for the `post-genome' world, when scientists know the complete sequence of DNA building blocks that make up the human genome." Mapping not only what is in the genome, but what the things in the genome do, is the real secret to comprehending and ultimately curing cancer and other diseases.

One of the more interesting developments in the medical application of DNA chips is the creation of the Affymetrix spin-off company, Perlegen Sciences Inc. Perlegen’s goal is to use DNA chips to help understand the dynamics underlying various diseases – startin out with the rare disease “ataxia telangiectasia” (A-T), with which the two sons of Perlegen co-founder Brad Margus are afflicted.

Ataxia is a word for loss of muscular coordination; telangiectasia refers to the small blood vessels that pop up on the skin and eyes A-T victims. A-T typically affects youths; 40% of A-T children develop cancer, and few live past their 20s. Margus was the boss of a $100 million-a-year shrimp processing company when he discovered his sons were afflicted with A-T -- and, in a remarkably systematic and dedicated fashion, began to devote more and more of his life to researching the biological foundations of the disease. He helped raise millions of dollars for research on A-T and its genetic basis, a quest that ultimately led him to Stephen Fodor.

Affymetrix array chips, it seemed to Margus and his bioscientist collaborators, could be used to study the way different individuals with A-T would react to different medications. It could vastly accelerate the drug discovery process, by allowing so many experiments to be run in parallel. Of course, this is exactly the kind of humanistically valuable application of DNA chip technology that makes Stephan Fodor happiest. It didn’t take much effort to convince Fodor that Affymetrix should help Margus in his quest, by helping to form Perlegen.

With humanistic applications like this swirling all around him, it’s not hard to see why Fodor is relatively unruffled by the ethical dilemmas that some find in genetic research. Are there potential dangers in this technology? To be sure. But there is also tremendous potential to help people. And so far there is no doubt that the positive far outweighs the negative. DNA chips have helped find cures for diseases, and they haven’t harmed anybody.

Of course, as Fodor says, this is just the beginning. We’ve mapped the genome, and now, baby step by baby step, we’re starting to understand the process by which strands of genetic material interact with other molecules to form organisms like us. As we move along this path of understanding, we’ll be able to cure more and more diseases, and more dramatic possibilities for genetic screening and genetic modification will open up. One can only hope that the optimism and focus on positive applications that Dr. Stephen Fodor embodies will continue to carry the day.

 

“…these new tools and high throughput techniques have unleashed a flood of biological data - information that continues to double in size every 12 months. …Looking forward, we are confident that informatics will represent the next quantum leap in drug discovery. We expect the market for informatics to reach 4 Billion by 2004”

Michael Clulow, UBS Warburg

“…There’s a concern on the part of biotech and pharmaceutical companies that they’re paying millions of dollars to generate millions of data points but not getting the value out of that data because they can’t analyze it with contemporary tools”

David K. Stone, AGTC Funds



I always thought genetics was fascinating – what curious young nerd wouldn’t? – but it wasn’t until mid-2000 that I began to seriously consider genetics and proteomics as a research area I might want to focus on.

Well before Webmind Inc. folded, I had grown seriously disenchanted with the application areas toward which we’d chosen to orient our products. Business success was proving elusive in spite of the fact that our products outperformed the competition’s, and – a largely separate issue – it didn’t seem that our products were evolving in directions that would make maximal use of our most original technology, the Webmind AI Engine. Financial prediction was fun, and the Webmind MP seemed to work outstandingly well -- but the essence of our approach was the use of news to predict market movements, and I wanted to get away from natural-language-processing-centric applications. The other products we were making – Webmind Classification System and Webmind Search (a search engine that was never released but was used internally within the company) – were even more human-language-centric. But the more we worked on our AI system, the more we on the R&D side realized that starting out with human-language-based products was putting the cart before the horse. We needed an application domain that had a rich variety of nonlinguistic data, that the system could reason about, building up a domain-specific knowledge base that could then be used to experientially ground linguistic knowledge – little by little, step by step, much as a human baby grounds its early linguistic knowledge in its observations about the nonlinguistic physical world it’s embedded in.

The finance domain did have its strong points – words about market movements could be correlated with actual observed market movements, for example, providing an elementary form of symbol grounding. But too often, in financial texts, the language was imprecise and evocative rather than precisely descriptive. More and more often, my thoughts began shifting to biology. There was so much biological data being generated – and it was so diverse. It was exciting to think of all this data being fed into an integrative AI system, a system capable of using it to draw new and interesting conclusions.

Of course, it didn’t take long to realize that biological data wasn’t really an ideal application domain either. It’s a great testing ground for integrative cognition, and even perception, but there are too few opportunities for an AI system to act. Actions such as sending information to human users are obviously present, and on a much slower time scale, a bio-focused AI can control robot arms running biological lab experiments. But all this is nothing similar to the intense perception/action/cognition interactivity that a baby gets from the physical world. So the idea of biological data as an application domain definitely doesn’t displace the EIL, Baby Novamente approach. But it is a worthy complement, and I’ve spent a decent portion of my time over the last year thinking about how to apply Novamente to analyze genetics data, and designing products and running prototype data analysis experiments in this direction.

Whether these products will ever get built, and this line of research continued, is not certain at this point. At the time of writing (early 2002), we’re seeking business funding for Biomind LLC, a company focused on these Novamente-based biology applications – but our quest for funding may fail, or we may be pushed in a different direction for one reason or another. But no matter how the cards fall, the process of exploring the potential applications of Novamente to genetics and biology generally will not have been fruitless. In working out these potential Novamente applications, we have seen a great deal of the future of AI-enhanced biology – and it’s an exciting future indeed.

The amount of data modern biologists are collecting is truly immense. About 100 microorganisms have been completely sequenced with many more in the pipeline. The human genome and other eukaryotic genomes such as yeast, Drosophila, and C. elegans are now available online. New sequencing projects begin almost daily. Microarrays and mass spectrometry produce massive datasets, which lead to massive data analysis problems. With each genomic sequence, there are more genes, more RNAs, more proteins, more phenotypes, and more data in databases. It is a blessing to have such data, but only if it can be accessed, integrated, and used to develop new knowledge. In May 2002, a couple months into my post-Webmind-Inc. phase, I realized that this was a mission worthy of a real AI. And what better way to start off an AI with a good attitude toward humans, than to have it focus its early energies on analyzing human cells, with a view toward curing human diseases and helping humans to live longer?

There is no doubt that existing biological databases contain the secrets to hundreds of undiscovered drugs. What I gradually realized during 2001 was what was needed draw these secrets out. Something simple yet elusive: data analysis software that automatically deploys this massive data pool within the experimental data analysis process. This new kind of feedback between wet lab work and advanced data analysis, once it’s achieved, will lead to a raft of new discovery, making the pharmaceutical progress of the last 10 years seem like the merest beginning. As a single, very important example, current tools make it very difficult to find sets of genes that can collectively function as drug targets (sites where gene-therapy drugs can act to interfere with a particular disease process) – whereas an integrative data analysis framework will in time make this kind of discovery routine.

Throughout Fall 2001 I talked extensively about these ideas with Maggie Werner-Washburne, a deeply insightful yeast geneticist in the University of New Mexico biology department. The more we talked, the more I realized what an excellent Novamente application this could be. For no single magic bullet, no one bioinformatic trick, is going to provide the deep, dynamic, goal-directed information integration that modern biology requires. What is needed is a combination of four ingredients:

· database integration

· visualization tools

· natural language processing (NLP) tools that extract information from research papers, adding information to databases

· automated inference tools that synthesize information from different databases

The biggest conclusion I drew from our conversations was this: Whomever can deliver these ingredients in a user-friendly package will be the one leading bioscience into the new millennium. This realization crystallized the vague bioinformatics ideas my Novamente colleagues and I had been tossing around. We began designing an ambitious bioinformatics software system called Biomind, which – if and when it’s completed -- will deploy the Novamente AI system toward the goal of helping biologists understand their experimental data in the context of the massive amount of general biological “background information” that now exists.

From a pure AI point of view, Biomind, like any practical application, is a bit of a digression from the straight road toward real AI. However, unlike any application we’d worked on before, all of us on the Novamente team find it tremendously scientifically fascinating in its own right. And Danny Hillis’s point about the value of practical applications is not to be overlooked. We are learning a lot right now by stress-testing Novamente cognition on bio databases.

One difference between Biomind work and pure Novamente “real AI” work is that the focus in the former case is not entirely on artificial intelligence, but equally much on intelligence augmentation – helping biological scientists to use their expertise and intuition to follow pathways to discovery. Biomind is not intended, in the short term, to thing about biology better than the biologists do. It’s intended to integrate broad information from databases better than biologists do, so that rather than spending their time sifting through huge databases and journal paper archives, biologists can spend their time thinking about biology. (Which they will continue to do better than Biomind, at least for a decade or two or three!)

To get a little more nitty-gritty, what we’re planning with Biomind falls into two categories: database-building, and AI-powered data analysis.

The database-building part is the most straightforward. We plan to use Novamente to integrate the knowledge contained in public biological DB’s, creating a massive database called the BiomindDB. And then, on the analytical side, we are creating a set of data-mining processes called the Biomind Toolkit – whose techniques, far from just analyzing each dataset on its own in the manner of other bioinformatics products, will analyze each dataset in the context of all the information in the BiomindDB.

And, in the slightly longer term, we can build BiomindDB’s not only for public biological data, but also for the proprietary data of client firms. Right now, many of the more progressive pharmaceutical and genomics/proteomics research companies are involved in massive projects of internal database integration. Large amounts of money are being poured into these projects, but the end result is neither a body of new knowledge nor a new and better approach to drug discovery. Rather, the end result is a common user interface for diverse databases. This is a valuable thing, but what is perhaps most valuable about it is that it sets the stage for Biomind and similar systems. Once a biotech firm has enabled their scientists to talk to all their databases through a common interface, they are ready to allow their own data to transform their own discovery process. They are ready:

· To create a new database consisting of knowledge formed by combining pieces of information from their various databases (a private BiomindDB)

· To analyze their experimental data (such as gene expression and mass spec data) making full use of the biological background knowledge contained in both the public BiomindDB and their proprietary BiomindDB

What is really fascinating here is that the BiomindDB, if and when we complete it, will be a valuable biological database that does not contain any novel “primary information.” All the information in it will be derived indirectly either from other databases or from research papers. However, it will contain novel pieces of information that are synthesized by combining pieces of information found elsewhere. Indeed this is its reason for being.

Very basic examples of information found in the BiomindDB would be:

· Genes that are similar to each other (overall, or along specific “axes” such as having similar sequences, similar promoter elements, or similar involvements in pathways and regulatory networks)

· Proteins that are similar to each other (overall, or along specific “axes” as with genes)

The ability to submit a gene or protein (or a set of genes or proteins) as a query, along with an axis of similarity, and receive back a list of similar genes/proteins, is a simple but remarkably powerful functionality.

The BiomindDB will also contain more specific information, of course. To give the details would require a distractingly long biology lesson, but some examples for the bio-savvy reader would be:

· Transcription factors that are activated by transport from the cytoplasm to the nucleus

· Enzymes with specific cofactors that are expressed in response to starvation

· Genes that are induced in response to some but not all starvations or stresses

· Enzymes active in different pathways and that are activated through a specific signal transduction pathway

· Proteins with homologs in procaryotic systems that interact and are required for survival under some conditions

· Proteins with homologs only in the fungi, that are coexpressed and interact with some of the same proteins

· Sets of interacting proteins, whose genes are all induced by the TOR pathway

All this is knowledge that a biology PhD could derive by reading the research literature and scanning relevant databases and keeping careful notes in a spreadsheet. But this kind of data surfing and integration can be very time-consuming and tedious, with the result that few scientists do it as thoroughly as they should. The task is often offloaded to assistants or (in the academic context) graduate students who lack the background knowledge to do a truly thorough and insightful job. Having this sort of integrative knowledge at one’s fingertips will be extremely valuable, and just as indispensable to the discovery path as a new source of primary physical data.

What will the Biomind Toolkit do with all this knowledge?? It will do a huge number of things, varying based on the particular kind of experimental data being fed

into it. The case we’ve worked with predominantly so far is gene expression data, generated by gene chips and spotted microarrays as discussed above.

Current toolkits for analyzing gene expression data, such as the excellent GeneSpring and Spotfire products, focus on the production of clusters or category models. Clusters are produced by statistical analysis of the expression profiles of genes, grouping together genes with similar expression profiles. Category models are produced when one provides the software with gene expression profiles of cells falling into different categories (e.g. cancerous vs. non-cancerous). A category model may tell you, for instance, that a certain set of 25 genes generally are more active in the cancerous cells than in the noncancerous cells.

The two big problems with these clustering and categorization tools are:

· The clusters and category models are produced often “wrong,” i.e., not biologically meaningful

· Even when meaningful, the clusters and categories found are often too large. For instance, having a set of 25 genes identified as important is not nearly as useful as having a set of 3-5 genes identified as important. Because it may take weeks of wet lab work to explore each potentially important gene in detail.

Two of the main initial functions that we’re building into the Biomind Toolkit are aimed at overcoming these problems. The first problem is overcome by invoking knowledge from the BiomindDB to guide the clustering and category model building process. For instance, if it’s known that two genes are involved in many of the same metabolic or regulatory pathways, this should bias the clustering process to suspect that perhaps these genes should be clustered together.

The second problem is overcome by what we call “post-clustering/categorization-analysis.” Suppose clustering or category-model-building has produced a set of genes of likely interest. One then wants to explore this gene set in detail. The Toolkit will give you a report summarizing the surprising and significant properties of this set of genes. For instance, it may note that there are many more fungal essential genes in the set than would be expected. It may note that of the 25 genes in the set, 5 are probably in the same signal transduction pathway, and 3 others are extremely similar to elements of this set of 5, based on various criteria. It may identify another gene, not in the set, which produces a protein that interacts with the proteins generated by several genes in the set. In this way the user’s attention is guided to 9 genes rather than 25, one of which was not even produced by the original clustering/categorization algorithms. The user may then use their own knowledge to infer that, if the one extra gene suggested by the Toolkit is interesting, perhaps a couple other similar genes may be interesting too. So it may search the BiomindDB to find genes similar to this recommended genes. This kind of post-clustering/categorization analysis can be done without the Toolkit and BiomindDB, but it may literally take days or weeks of tedious work.

 
 

 
The VxInsight tool for visualizing gene expression data
 

The Biomind Toolkit, as it currently exists in prototype form, has several different clustering methods built into it, including k-means, EM, and other standard techniques. It also has a less-well-known method that we have found to be uncommonly effective on gene expression data, which is force-directed clustering as embodied in the VxInsight product, a piece of software created at Sandia Labs and commercialized by Viswave Inc. This technique is particularly valuable in that it provides a user-friendly topographical visualization of gene expression data, in which each cluster of similar genes appears as a “hill” and related clusters are depicted as nearby hills. In our experience, this is a particularly intuitive metaphor for biologists to use to explore gene expression data, and we feel that this aspect of the Toolkit is sure to be a hit with customers. A landscape visualization of raw gene expression data, followed by a landscape visualization of the data as interpreted in the context of the BiomindDB, will be an extremely instructive part of the discovery process.

The third initial function of the Toolkit, approximate regulatory network inference, goes beyond what current products do. It is applicable specifically to time course gene expression data, e.g. data reporting the expression levels of genes at 20-100 points during the cell cycle. Currently not that many labs are creating time course data sets of sufficient length to enable meaningful approximate regulatory network inference, but the presence of a software tool capable of carrying out this sort of inference will have an influence on experimental methodology, and we expect that, following the release of the Toolkit, one will see such experiments being carried out routinely. The input here is a time course data set; and the output, after some Biomind AI Engine inference, is a set of probabilistic rules relating the expression levels of the genes in the experiment (for instance, “After these two genes are simultaneously expressed, then if this third gene is NOT expressed shortly thereafter, it’s likely that this fourth gene will be significantly expressed.”)

Our experiments to date indicate that full regulatory network inference is definitely at the fringes of feasibility given the poor quality of available expression data. However, extracting valuable approximate information about regulatory networks is a much more approachable task. Novamente compensates for poor data quality by integrating diverse background data into the analytic process. It seeks, not an exact picture of the regulatory network, but rather a rough probabilistic approximation to network dynamics, expressed in terms of nodes called CompoundRelationNodes that capture logical and arithmetic relationships among gene expression levels in single or multiple experiments.

The post-clustering/categorization analysis and clustering/categorization-enhancement functions of the Biomind Toolkit do not require direct invocation of the Novamente AI system. They require only local processing activity on the user’s machine, and accesses to the global BiomindDB. Regulatory network inference on the other hand requires Novamente activity – hard thinking. There’s no getting around this.

And the next step, after implementing the BiomindDB and the Toolkit as fully professional software products, will be to extend the Toolkit functions to the cross-dataset level. Because, as valuable as the analysis of experimental data in the context of general biological background knowledge is, it is only a start. Many large labs conduct large sets of interrelated experiments, and in some cases the patterns in one data set may only be fully detectable against the background of the other data sets. This kind of analysis requires the construction of a client-side BiomindDB containing at minimum the data from a number of different experiments. A local Biomind AI Engine can then be set to work analyzing this data, reporting emergent patterns of various sorts.

Clusters and categories can be found, based on considering each gene’s set of expression profiles across multiple experiments. And the inference of potential regulatory patterns is yet more powerful here than in the single-experiment case, because interaction patterns that are dependent on context can be inferred. And it is here, in the analysis of multiple experiments in the context of one another and of general biological background knowledge, that the face of 21’st century biodata analysis can truly be seen.

Novamente’s particular strength in this kind of application is its ability to integrate different types of data to arrive at overall judgments. This is what, we believe, will enable it to solve the “islands of information” issue plaguing the life sciences today. When datasets are so diverse and are spread across different locations, and reside on architecturally incompatible platforms, it is extremely difficult to leverage value from the body of knowledge as a whole. By actively learning, understanding and abstracting inferences from data, regardless of its native structure, Novamente can assemble a cogent universe of knowledge out of a disparate collection of databases.

Although we’re still at the phase of prototyping and experimentation, a fair bit of work has already gone into tailoring Novamente for this sort of application. We have added such things as GeneNodes, GeneSequenceNodes, ProteinNodes and CellNodes, structures that provide the system with a framework on which to construct its knowledge of the biological domain. These nodes don’t contain detailed knowledge about biological systems, but they are associated with learning methods that are particularly effective for learning about data regarding the biological entities they represent.

Finally, although we don’t want to center our bioinformatics work around natural language processing, it can’t be avoided entirely; there is too much practical need for it. We have designed a tool we call BioSpyder, whose purpose is to scan a collection of documents containing biological information, and outputs a set of formal, “mathematical” Novamente relationships embodying the knowledge in these documents. In the first version only English-language documents will be usable as input; extension to other languages will require substantially more development time. Of course, until the EIL process has given Novamente a real understanding of biologically-relevant semantics and syntax, this process will be highly imperfect. Even so, though, we believe our technology will allow us to do a much better job than any competing “biological information extraction” system existing in commerce or academia. Using computational linguistics algorithms similar to technology prototyped at Webmind Inc., combined with Novamente, we believe we’ll be able to extract not only simple information like which genes are associated with each other and which genes are involved in the same signal transduction pathways, but more complex relationships such as protein-protein interactions that occur only in certain contexts, the distinction between pathway relationships that are conjectured versus those that are fairly solidly demonstrated, etc.

All this will be a lot of work, and as I noted above, this work may or may not get done, depending on many factors, including other projects that may come our way, and the availability of bioinformatics-oriented funding. But if we don’t do this stuff, somebody else will – not exactly the same way, but ultimately with a similar effect. The convergence of bioinformatics and AI isn’t going to stop in 2002, and nor is the centrality of computational techniques to biological discovery. The volume and complexity of data generated by microarrays and other modern experimental biology tools is simply beyond the scope of the human brain. AI is needed to make sense of it, and make sense of it AI will.

 

Of course, all this talk about gene expression data and the like is just barely scratching the surface of 21’st century biology. What we’re really moving towards is what biologists call systems biology – an understanding of the self-organizing wholeness of biological systems like cells, organs and organisms. Sequencing genomes is one step in this direction, gathering gene expression data is another, gathering protein expression data is another (still more primitive), and yet further technologies will no doubt bring us yet more insightful data as time goes on. Today biology is still largely a science of disorganized, loosely connected factoids, and vague, subtle, intuitive trends. But there may well be a mini-Singularity of biological knowledge, that occurs when all this micro-level data, fed into a BiomindDB or something similar, is enough to bridge the micro-macro gap.

Specifically, one may foresee two coming singularities of bioknowledge. The first will be when analysis of gene and protein expression, pathways, regulatory networks and related phenomena allow us to really understand the cell. The gap between the “macro” level of the cell and the “micro” level of its component molecules will be closed. The properties of cells will be concretely comprehensible as emergent patterns from molecular dynamics. We’re certainly not at this point yet, but it could well happen in the next 5 years, via Biomind technology or something similar.

The second bioknowledge singularity will be when cellular knowledge comes together to form organismic knowledge. My intuition, for what it’s worth in this domain, is that this singularity will actually be easier than the first one. The step from cells to organisms is a tough one, but it may not be quite as perversely subtle as the step from molecules to cells – because it all happens at a relatively macroscopic scale, where things are easier to measure and to intuitively understand.

AI will lead us to these singularities, hand in hand with human biologists, and it’s going to be quite an adventure. It sounds ironic and in some ways a bit peculiar, but my guess is that one of the major roles of nonhuman and later superhuman digital intelligence, as the path to posthumanity unfolds, will be to help understand the subtle emergent properties of the fabulously complex molecular machinery that makes us humans what we are – and to help us figure out how to make this machinery work better, ridding us of pesky little problems like cancer and again. This is what biology needs, and in a less direct sense it may be what AI needs as well. I cannot think of a better way to encourage cooperation between humans and AI’s, and to nudge incipient superintelligent AI’s to think of human lives as valuable.