The need for weighting the word probabilities in speech recognition is discussed in Section 9.6 of Speech and Language Processing,* by Dan Jurafsky and James Martin (2nd ed., Prentice Hall, 2009). My paper on Naïve Bayes, with Mike Pazzani, is “On the optimality of the simple Bayesian classifier under zero-one loss”* (Machine Learning, 1997; expanded journal version of the 1996 conference paper). Judea Pearl’s book,* mentioned above, discusses Markov networks along with Bayesian networks. Markov networks in computer vision are the subject of Markov Random Fields for Vision and Image Processing,* edited by Andrew Blake, Pushmeet Kohli, and Carsten Rother (MIT Press, 2011). Markov networks that maximize conditional likelihood were introduced in “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,”* by John Lafferty, Andrew McCallum, and Fernando Pereira (International Conference on Machine Learning, 2001).
The history of attempts to combine probability and logic is surveyed in a 2003 special issue* of the Journal of Applied Logic devoted to the subject, edited by Jon Williamson and Dov Gabbay. “From knowledge bases to decision models,”* by Michael Wellman, John Breese, and Robert Goldman (Knowledge Engineering Review, 1992), discusses some of the early AI approaches to the problem.
Chapter Seven
Frank Abagnale details his exploits in his autobiography, Catch Me If You Can, cowritten with Stan Redding (Grosset & Dunlap, 1980). The original technical report on the nearest-neighbor algorithm by Evelyn Fix and Joe Hodges is “Discriminatory analysis: Nonparametric discrimination: Consistency properties”* (USAF School of Aviation Medicine, 1951). Nearest Neighbor (NN) Norms,* edited by Belur Dasarathy (IEEE Computer Society Press, 1991), collects many of the key papers in this area. Locally linear regression is surveyed in “Locally weighted learning,”* by Chris Atkeson, Andrew Moore, and Stefan Schaal (Artificial Intelligence Review, 1997). The first collaborative filtering system based on nearest neighbors is described in “GroupLens: An open architecture for collaborative filtering of netnews,”* by Paul Resnick et al. (Proceedings of the 1994 ACM Conference on Computer-Supported Cooperative Work, 1994). Amazon’s collaborative filtering algorithm is described in “Amazon.com recommendations: Item-to-item collaborative filtering,”* by Greg Linden, Brent Smith, and Jeremy York (IEEE Internet Computing, 2003). (See Chapter 8’s further readings for Netflix’s.) Recommender systems’ contribution to Amazon and Netflix sales is referenced in, among others, Mayer-Schönberger and Cukier’s Big Data and Siegel’s Predictive Analytics (cited earlier). The 1967 paper by Tom Cover and Peter Hart on nearest-neighbor’s error rate is “Nearest neighbor pattern classification”* (IEEE Transactions on Information Theory).
The curse of dimensionality is discussed in Section 2.5 of The Elements of Statistical Learning,* by Trevor Hastie, Rob Tibshirani, and Jerry Friedman (2nd ed., Springer, 2009). “Wrappers for feature subset selection,”* by Ron Kohavi and George John (Artificial Intelligence, 1997), compares attribute selection methods. “Similarity metric learning for a variable-kernel classifier,”* by David Lowe (Neural Computation, 1995), is an example of a feature weighting algorithm.
“Support vector machines and kernel methods: The new generation of learning machines,”* by Nello Cristianini and Bernhard Schölkopf (AI Magazine, 2002), is a mostly nonmathematical introduction to SVMs. The paper that started the SVM revolution was “A training algorithm for optimal margin classifiers,”* by Bernhard Boser, Isabel Guyon, and Vladimir Vapnik (Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 1992). The first paper applying SVMs to text classification was “Text categorization with support vector machines,”* by Thorsten Joachims (Proceedings of the Tenth European Conference on Machine Learning, 1998). Chapter 5 of An Introduction to Support Vector Machines,* by Nello Cristianini and John Shawe-Taylor (Cambridge University Press, 2000), is a brief introduction to constrained optimization in the context of SVMs.
Case-Based Reasoning,* by Janet Kolodner (Morgan Kaufmann, 1993), is a textbook on the subject. “Using case-based retrieval for customer technical support,”* by Evangelos Simoudis (IEEE Expert, 1992), explains its application to help desks. IPsoft’s Eliza is described in “Rise of the software machines” (Economist, 2013) and on the company’s website. Kevin Ashley explores case-based legal reasoning in Modeling Legal Arguments* (MIT Press, 1991). David Cope summarizes his approach to automated music composition in “Recombinant music: Using the computer to explore musical style” (IEEE Computer, 1991). Dedre Gentner proposed structure mapping in “Structure mapping: A theoretical framework for analogy”* (Cognitive Science, 1983). “The man who would teach machines to think,” by James Somers (Atlantic, 2013), discusses Douglas Hofstadter’s views on AI.
The RISE algorithm is described in my paper “Unifying instance-based and rule-based induction”* (Machine Learning, 1996).
Chapter Eight
The Scientist in the Crib, by Alison Gopnik, Andy Meltzoff, and Pat Kuhl (Harper, 1999), summarizes psychologists’ discoveries about how babies and young children learn.
The k-means algorithm was originally proposed by Stuart Lloyd at Bell Labs in 1957, in a technical report entitled “Least squares quantization in PCM”* (which later appeared as a paper in the IEEE Transactions on Information Theory in 1982). The original paper on the EM algorithm is “Maximum likelihood from incomplete data via the EM algorithm,”* by Arthur Dempster, Nan Laird, and Donald Rubin (Journal of the Royal Statistical Society B, 1977). Hierarchical clustering and other methods are described in Finding Groups in Data: An Introduction to Cluster Analysis,* by Leonard Kaufman and Peter Rousseeuw (Wiley, 1990).
Principal-component analysis is one of the oldest techniques in machine learning and statistics, having been first proposed by Karl Pearson in 1901 in the paper “On lines and planes of closest fit to systems of points in space”* (Philosophical Magazine). The type of dimensionality reduction used to grade SAT essays was introduced by Scott Deerwester et al. in the paper “Indexing by latent semantic analysis”* (Journal of the American Society for Information Science, 1990). Yehuda Koren, Robert Bell, and Chris Volinsky explain how Netflix-style collaborative filtering works in “Matrix factorization techniques for recommender systems”* (IEEE Computer, 2009). The Isomap algorithm was introduced in “A global geometric framework for nonlinear dimensionality reduction,”* by Josh Tenenbaum, Vin de Silva, and John Langford (Science, 2000).