User:WillWare/Automation of science
The reason for wanting to automate the scientific process is in order to hasten progress in science. This is particularly important in the advance of medicine, because I'm getting older and I want to accelerate the arrival of medical technologies that might make me live longer.
With that aim in view, computers should
- look for patterns in data (data mining)
- propose falsifiable hypotheses
- design experiments to test hypotheses
- perform experiments & collect data
- confirm/deny hypotheses
- mine new data for new patterns, repeat
Precedents
Eurisko
Doug Lenat and Eurisko gained notoriety by submitting the winning fleet (a large number of stationary, highly weaponed, defenseless ships) to the United States Traveller TCS national championship in 1981, forcing extensive changes to the game's rules.
This story has circulated quite a bit and is quite famous. Unfortunately Lenat has never released the source code for Eurisko. So an open-source version of Eurisko would be a welcome thing, and that idea was an early stimulus of my thinking about automation of science.
Adam the "Robot Scientist"
Reported in April 2009 by Ross King at Aberystwyth University. It uses lab automation to perform experiments, and data mining to find patterns in the resulting data. Adam developed novel genomics hypotheses about S. cerevisiae yeast and tested them. Adam's conclusions were manually confirmed by human experimenters, and found to be correct.
Eureqa
Eureqa is a software tool for detecting equations and hidden mathematical relationships in data. Its primary goal is to identify the simplest mathematical formulas which could describe the underlying mechanisms that produced the data. Eureqa is free to download and use, but it is not open source. So we need an open source equivalent. Luckily the ideas behind Eureqa are laid out pretty plainly.
Eureqa generates a parsimonious curve-fitting function for a set of data. Genetic programming appears to be the preferred way for doing this. Hod Lipson discussed Eureqa in his talk, and his comment was that it's easy to generate such a mathematical model but harder to invent an explanatory theory for the model.
It may turn out to be very difficult to automate the creation of explanatory theories. Conceivably a machine might mine data, discover a pattern, and formulate a model for the pattern, but then ask for human intervention in coming up with an explanatory theory. The human might provide that leap of insight, and then figure out some predictions of the new theory, which might also require human insight. As long as all this is expressed in machine-readable language, there is still a big role for the machine, and the prospect that future machines might be able to provide insight.
- What Eureqa does is regression, of which linear regression is a special case. It suggests a sequence of things that would be done in proposing a hypothesis.
- Collect data from experiments done by humans or by robots.
- Study the data to identify mathematical patterns - this is regression - use Eureqa here.
- Invent explanatory stories about why we see those particular patterns, drawing on previous explanations for related observed patterns.
- Figure out what testable predictions would follow from those stories, and design experiments to test those predictions.
What next?
Adam is designed to work alone with no connection to the broader scientific literature. It is confined to a very narrow problem domain. To broaden the effort, we need an ontology (ideally a widely recognized standard) for machine-parseable sharing of elements of the scientific reasoning process: data sets, hypotheses, predictions, deduction, induction, statistical inference, and the design of experiments.
We need versions of Adam designed for other problem domains, and to the extent possible they should share common vocabulary so that they don't work in isolation.
In the long term, we want a world where machine theoreticians and machine experimentalists collaborate with their human counterparts, in a process that makes the best use of the unique intellectual strengths of each.
Reasoning scenarios
- Pure symbolic logic (no probabilities orconfidence levels)
- Hypotheses with blanket probabilities
- each hypothesis describes a world, each world has logic propositions, but no probabilities
- use empirical evidence to update blanket probabilities
- Assign probabilities to individual propositions
- Statistical inference replaces logical deduction
- Get smart about the role of uncertainty
- Work with noisy analog data
- Get smart about signal processing, probability distributions
- Study the noise to look for deeper structures
Semantic markup for existing scientific and medical literature
Immediately useful for constructing a semantic search engine for medicine and research
Motivates development of science ontology
Machines should eventually publish journal articles
Maybe it will tell us something interesting about how humans do science
Fund long-term work by monetizing near-term work
IANAVC, but maybe one of these would work...
- Semantic search engine for doctors and researchers
- Build an oracle, win bets - politics, finance, climate
- Dual-license it and charge for commercial use
- Offer consulting services