More on the various tribes’ paths to the Master Algorithm in the corresponding sections below.
Chapter Three
Hume’s classic formulation of the problem of induction appears in Volume I of A Treatise of Human Nature (1739). David Wolpert derives his “no free lunch” theorem for induction in “The lack of a priori distinctions between learning algorithms”* (Neural Computation, 1996). I discuss the importance of prior knowledge in machine learning in “Toward knowledge-rich data mining”* (Data Mining and Knowledge Discovery, 2007) and misinterpretations of Occam’s razor in “The role of Occam’s razor in knowledge discovery”* (Data Mining and Knowledge Discovery, 1999). Overfitting is one of the main themes of The Signal and the Noise, by Nate Silver (Penguin Press, 2012), who calls it “the most important scientific problem you’ve never heard of.” “Why most published research findings are false,”* by John Ioannidis (PLoS Medicine, 2005), discusses the problem of mistaking chance findings for true ones in science. Yoav Benjamini and Yosef Hochberg propose a way to combat it in “Controlling the false discovery rate: A practical and powerful approach to multiple testing”* (Journal of the Royal Statistical Society, Series B, 1995). The bias-variance decomposition is presented in “Neural networks and the bias/variance dilemma,” by Stuart Geman, Elie Bienenstock, and René Doursat (Neural Computation, 1992). “Machine learning as an experimental science,” by Pat Langley (Machine Learning, 1988), discusses the role of experimentation in machine learning.
William Stanley Jevons first proposed viewing induction as the inverse of deduction in The Principles of Science (1874). The paper “Machine learning of first-order predicates by inverting resolution,”* by Steve Muggleton and Wray Buntine (Proceedings of the Fifth International Conference on Machine Learning, 1988), initiated the use of inverse deduction in machine learning. The book Relational Data Mining,* edited by Sašo Džeroski and Nada Lavrač (Springer, 2001), is an introduction to the field of inductive logic programming, where inverse deduction is studied. “The CN2 Induction Algorithm,”* by Peter Clark and Tim Niblett (Machine Learning, 1989), summarizes some of the main Michalski-style rule induction algorithms. The rule-mining approach used by retailers is described in “Fast algorithms for mining association rules,”* by Rakesh Agrawal and Ramakrishnan Srikant (Proceedings of the Twentieth International Conference on Very Large Databases, 1994). An example of rule induction for cancer prediction is described in “Carcinogenesis predictions using inductive logic programming,” by Ashwin Srinivasan, Ross King, Stephen Muggleton, and Michael Sternberg (Intelligent Data Analysis in Medicine and Pharmacology, 1997).
The two leading decision tree learners are presented in C4.5: Programs for Machine Learning,* by J. Ross Quinlan (Morgan Kaufmann, 1992), and Classification and Regression Trees,* by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone (Chapman and Hall, 1984). “Real-time human pose recognition in parts from single depth images,”* by Jamie Shotton et al. (Communications of the ACM, 2013), explains how Microsoft’s Kinect uses decision trees to track gamers’ motions. “Competing approaches to predicting Supreme Court decision making,” by Andrew Martin et al. (Perspectives on Politics, 2004), describes how decision trees beat legal experts at predicting Supreme Court votes and shows the decision tree for Justice Sandra Day O’Connor.
Allen Newell and Herbert Simon formulated the hypothesis that all intelligence is symbol manipulation in “Computer science as empirical enquiry: Symbols and search” (Communications of the ACM, 1976). David Marr proposed his three levels of information processing in Vision* (Freeman, 1982). Machine Learning: An Artificial Intelligence Approach,* edited by Ryszard Michalski, Jaime Carbonell, and Tom Mitchell (Tioga, 1983), gives a snapshot of the early days of symbolist research in machine learning. “Connectionist AI, symbolic AI, and the brain,”* by Paul Smolensky (Artificial Intelligence Review, 1987), gives a connectionist view of symbolist models.
Chapter Four
Sebastian Seung’s Connectome (Houghton Mifflin Harcourt, 2012) is an accessible introduction to neuroscience, connectomics, and the daunting challenge of reverse engineering the brain. Parallel Distributed Processing,* edited by David Rumelhart, James McClelland, and the PDP research group (MIT Press, 1986), is the bible of connectionism in its 1980s heyday. Neurocomputing,* edited by James Anderson and Edward Rosenfeld (MIT Press, 1988), collates many of the classic connectionist papers, including: McCulloch and Pitts on the first models of neurons; Hebb on Hebb’s rule; Rosenblatt on perceptrons; Hopfield on Hopfield networks; Ackley, Hinton, and Sejnowski on Boltzmann machines; Sejnowski and Rosenberg on NETtalk; and Rumelhart, Hinton, and Williams on backpropagation. “Efficient backprop,”* by Yann LeCun, Léon Bottou, Genevieve Orr, and Klaus-Robert Müller, in Neural Networks: Tricks of the Trade, edited by Genevieve Orr and Klaus-Robert Müller (Springer, 1998), explains some of the main tricks needed to make backprop work.
Neural Networks in Finance and Investing,* edited by Robert Trippi and Efraim Turban (McGraw-Hill, 1992), is a collection of articles on financial applications of neural networks. “Life in the fast lane: The evolution of an adaptive vehicle control system,” by Todd Jochem and Dean Pomerleau (AI Magazine, 1996), describes the ALVINN self-driving car project. Paul Werbos’s PhD thesis is Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences* (Harvard University, 1974). Arthur Bryson and Yu-Chi Ho describe their early version of backprop in Applied Optimal Control* (Blaisdell, 1969).
Learning Deep Architectures for AI,* by Yoshua Bengio (Now, 2009), is a brief introduction to deep learning. The problem of error signal diffusion in backprop is described in “Learning long-term dependencies with gradient descent is difficult,”* by Yoshua Bengio, Patrice Simard, and Paolo Frasconi (IEEE Transactions on Neural Networks, 1994). “How many computers to identify a cat? 16,000,” by John Markoff (New York Times, 2012), reports on the Google Brain project and its results. Convolutional neural networks, the current deep learning champion, are described in “Gradient-based learning applied to document recognition,”* by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner (Proceedings of the IEEE, 1998). “The $1.3B quest to build a supercomputer replica of a human brain,” by Jonathon Keats (Wired, 2013), describes the European Union’s brain modeling project. “The NIH BRAIN Initiative,” by Thomas Insel, Story Landis, and Francis Collins (Science, 2013), describes the BRAIN initiative.
Steven Pinker summarizes the symbolists’ criticisms of connectionist models in Chapter 2 of How the Mind Works (Norton, 1997). Seymour Papert gives his take on the debate in “One AI or Many?” (Daedalus, 1988). The Birth of the Mind, by Gary Marcus (Basic Books, 2004), explains how evolution could give rise to the human brain’s complex abilities.