No Myths...Only a Beginning
``The linguistic content of our program thus far is scant indeed. It is limited to one set of rules for analyzing a string of characters into a string of words, and another set of rules for analyzing a string of words into a string of sentences. Doubtless even these can be recast in terms of some information theoretic objective function. But it is not our intention to ignore linguistics, neither to replace it. Rather, we hope to enfold it in the embrace of a secure probabilistic framework so that the two together may draw strength from one another and guide us to better natural language processing systems...''
From The Mathematics of Statistical Machine Translation
Basic Concepts from Probability and Statistics: Laws of probability, Bayes' theorem, maxium likelihood, estimators (variance, bias, consistency, efficiency), learning curves, overfitting curves, model selection: penalization, cross validation.
References:Basic Concepts from Information Theory: Properties of entropy, Kullback-Leibler divergence, mutual information, data processing inequality, compression and coding, arithmetic coding, Shannon game, the source-channel model. Applications: speech, translation, spelling correction, OCR, information retrieval.
- T. Ferguson, Mathematical Statistics: A Decision Theoretic Approach, Academic Press, 1967
- E. Lehmann and G. Casella, Theory of Point Estimation, Springer, 1998
- P. Bickel and K. Doksum, Mathematical Statistics: Basic Ideas and Selected Topics, Prentice Hall, 2000
- G. Casella and R. Berger, Statistical Inference, Wadsworth Pub Co, 2001
- T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, Springer, 2001
References:Statistics and linguistic, linguistic essentials: part of speech, morphology, phrase structure, semantics, pragmatics.
- Claude Shannon. A mathematical theory of communication. Bell System Technical Journal, 27, pp. 379-423 and 623-656, 1948.
- Claude Shannon. Prediction and entropy of printed English. Bell System Technical Journal, 30, pp. 50-64, 1951.
- P. Brown, S. Della Pietra, V. Della Pietra, J. Lai and R. Mercer. An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1), pp. 31-40, 1992.
- T. Cover and R. King. A convergent gambling estimate of the entropy of English, IEEE Trans. on Information Theory, 24(4), pp. 413-421, 1978.
- See also the Cover and Thomas text
- Stanford Information Theory Class.
- MIT Information Theory Class.
- Information Theory on the Web.
References:
- Steven Abney. Statistical methods and linguistics. The Balancing Act, J. Klavans and P. Resnik, eds, MIT Press, 1996
- Steven Abney. Statistical methods. Encyclopedia of Cognitive Science, Nature Publishing Group, Macmillian, 2002
- J. Bellegarda. Robustness in statistical language modeling: review and perspectives. Robustness in Languages and Speech Technology, J. Junqua and G. van Noods (Editors), Kluwer Academic Publishers, 2001.
- C. Manning. Probabilistic syntax. Probabilistic Linguistics, Rens Bod, Jennifer Hay, and Stefanie Jannedy (Editors), MIT Press, 2002.
- D. Mumford. The dawning of the age of stochasticity. 1999
- F. Pereira. Formal grammar and information theory: together again?. Philosophical Transactions of the Royal Society, 358(1769):1239-1253, April 2000.
- R. Rosenfeld. Two decades of statistical language modeling: Where do we go from here?. Proceedings of the IEEE, 88(8), pp. 1270-1278, 2000.
- Alan M. Turing. Computing machinery and intelligence. Mind, pp. 433-460, 1950.
Entropy rate of a stochastic processe, Breiman-McMillan-Shannon theorem, perplexity and alternative measures, data sparseness, conditional modeling, history partitioning, word frequencies, Zipf's law, type-token curves, vocabulary and n-gram growth.
References:N-grams: smoothing, discounting, the Good-Turing estimate, the zero frequency problem, the backoff model, A Dirichlet language model, Ngram data structures, the CMU-Cambridge toolkit .
- T. Bell, J. Cleary, and Ian H. Witten, Text Compression, Prentice-Hall, 1990.
- Wentian Li. Random texts exhibit Zipf's-law-like word frequency distribution. IEEE Trans. on Information Theory, 38(6), pp. 1842-1845, 1992.
- P. Harremoes and F. Topsoe. Zipf's law, hyperbolic distributions and entropy loss. IEEE International Symposium on Information Theory, 2002.
- Zipf's Law (webpage by Wentian Li). Many Zipf's Law refs in many fields.
- G. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
References:
- S. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech & Language. 13(4), pp. 319-358, October 1999 .
- K. Church and W. Gale. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5, 19-54, 1991.
- B. Efron and R. Thisted. Estimating the number of unseen species: How many words did Shakespear know?. Biometrika, 63, 435-447, 1976.
- Irving J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40, pp. 237-264, 1953.
- Irving J. Good, Turing's anticipation of empirical Bayes in connection with the cryptanalysis of the Naval Enigma. Journal of Statistical Computation and Simulation, 66(2), pp. 101-112, 2000.
- Jelinek text, chapter 15.
- Slava M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3), pp. 400-401, 1987.
- D. McAllester and R. Schapire. On the convergence rate of Good-Turing estimators. Proceedings of the 13th Annual Conference on Computational Learning Theory, 2000.
- D. McAllester and R. Schapire. Learning theory and language modeling. Exploring Artificial Intelligence in the New Millenium, G. Lakemeyer and B. Nebel (editors), Morgan Kaufmann, 2002.
- Arthur Nadas, On Turing's formula for word probabilities. IEEE Transactions on Acoustics, Speech, and Signal Processing. 33(6), 1414-1416, 1985.
- Arthur Nadas. Good, Jelinek, Mercer, and Robins on Turing's estimate of probabilities. American Journal of Mathematical and Management Sciences, 11, 229-308, 1991.
- H. Ney, U. Essen, and R. Knese. On structuring probabilistic dependences in stochastic language modelling, Computer Speech & Language, 8(1), pp. 1-38, 1994.
- H. Ney, S. Martin, and F. Wessel. Statistical language modeling using leaving-one-out, Corpus-based methods in language and speech processing, S. Young and G. Bloothooft (Editors), pp. 174-207, Kluwer Academic Publishers, 1997.
- Geoffrey Sampson's Good-Turing Frequency Estimation page
- I. Witten and T. Bell. The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4), 1991.
The basic algorithm and example applications. The mathematics underlying the algorithm.
References:
- M. Collins, 1997. The EM Algorithm. Written preliminary exam paper.
- A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39(1), pp. 1-38, 1977.
- G. McLachlan and T. Krishnan, The EM Algorithm and Extensions. Wiley, 1997.
- X. Meng and D. Rubin. Maximum likelihod estimation via the ECM algorithm: A general framework. Biometrika, 80, pp. 267-278, 1993.
- X. Meng and D. van Dyk. The EM algorithm--an old folk-song sung to a fast new tune, 59(3), pp. 511-567, 1997.
- P. Tseng. An analysis of the EM algorithm and entropy-like proximal point methods, Mathematics of Operations Research,29(1):27-44, 2004.
- C. Wu. On the convergence properties of the EM algorithm. Annals of Statistics, 11(1), pp. 95-103, 1983.
Markov chains, hidden Markov models and the forward-backward algorithm, deleted interpolation, tagging.
References:
- L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), pp. 257-286, 1989.
- D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun. A practical part-of-speech tagger. Proceedings of the Third Conference on Applied Natural Language Processing, pp. 133-140, 1992. Available as ps and pdf
- S. DeRose. Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14(2), pp. 31-39, 1988.
- D. Elworthy. Does Baum-Welch Re-estimation help taggers?Proceedings of the 4th Conference on Applied Natural Language Processing, 1994.
- Julian Kupiec. Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language 6, pp. 225-242, 1992.
- B. Merialdo. Tagging English text with a probabilistic model. Computational Linguistics, 20(2), pp. 155--171, 1994.
- R. Weischedel, M. Meteer, R. Schwartz, L. Ramshaw, and J. Palmucci. Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics, 19(2), pp. 359--382, 1993.
- S. Abney and M. Light. Hiding a semantic hierarchy in a Markov model. Proceedings of the ACL'99 Workshop on Unsupervised Learning in Natural Language Processing , pp. 1-8, 1999. (Longer version)
- M. Ciaramita and M. Johnson. Explaining away ambiguity: Learning verb selectional preference with Bayesian networks. Proceedings of the 18th International Conference on Computational Linguistics, 2000
- Hang Li and Naoki Abe. Generalizing case frames using a Thesaurus and the MDL principle. Computational Linguistics, 24(2), 1998.
- Philip Resnik. Selectional preference and sense disambiguation. ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How?, 1997
- WordNet. See also WordNet: An Electronic Lexical Database, Christiane Fellbaum, ed., 1998.
Hierarchical clustering, mutual information techniques, information bottleneck method, decision trees: the CART technique, margin-based methods: SVMs and boosting, bootstrapping.
References:
- P. Brown, V. Della Pietra, P. deSouza, J. Lai and R. Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4), pp. 467-480, 1992.
- L. Bahl, P. Brown, P. deSouza and R. Mercer. A tree-based statistical language model for natural language speech recognition. IEEE Transcation on Acoustics, Speech and Signal Processing, 37, pp. 1001--1008, 1989.
- K. Nigam, A. McCallum, S. Thrun and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3). pp. 103-134. 2000.
- N. Tishby, F. Pereira and W. Bialek. The information bottleneck method, Proceedings of the 37th Allerton Conference on Communication, Control and Computing, 1999.
- N. Tishby and N. Slonim. Data clustering by Markovian relaxation and the information bottleneck method. Advances in Neural Information Processing Systems 13, 2000.
- Steven Abney. Bootstrapping. Proceedings of the Conference 40th Annual Meeting of the Association for Computational Linguistics, 2002.
The Chomsky hierarchy, probabilistic context free grammars and inside-outside algorithm, lexicalized probabilistic parsing, phrase structure grammars and dependency grammars, link grammars, structured language model, language and complexity.
References:
- Steven Abney, David McAllester, and Fernando Pereira. Relating probabilistic grammars and automata. Proceedings of the 37th Annual Meeting of the ACL, pp. 542-549, 1999
- T. Booth and R. Thompson. Applying probability measures to abstract languages, IEEE Transactions on Computers, 22, pp. 442--450, 1973
- E. Charniak. Immediate-head parsing for language models, Proceedings of the 39th Annual Meeting of the ACL, pp. 116-123, 2001.
- C. Chelba and F. Jelinek. Structured language modeling. Computer Speech and Language, 14(4), pp. 283-332, 2000.
- Z. Chi. Statistical properties of probabilistic context-free grammars, Computational Linguistics, 25(1), pp. 131-160, 1999.
- Michael Collins. Three generative, lexicalised models for statistical parsing. Proceedings of the 35th Annual Meeting of the ACL, 1997.
- Michael Collins. A new statistical parser based on bigram lexical dependencies. Proceedings of the 34th Annual Meeting of the ACL, 1996.
- F. Jelinek, J. Lafferty, and R. Mercer. Basic methods of probabilistic context free grammars. Speech Recognition and Understanding, P. Laface and R. De Mori, eds., Springer, pp. 347-360, 1992.
- K. Lari and S. Young. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech & Language, 4, pp. 35--56, 1990.
- John Lafferty's notes on Probabilistic context free grammar. 2001.
- S. Levinson. Structural methods in automatic speech recognition. Proceedings of the IEEE, 73(11), pp. 1625-1650, 1985.
- M. Miller and J. O'Sullivan. Entropies and combinatorics of random branching processes and context-free languages, IEEE Trans. on Information Theory, 38(4), pp. 1292-1310, 1992
- Brian Roark. Probabilistic top-down parsing and language modeling. Computational Linguistics, 27(2), pp. 249-285, 2001.
- A. Stolcke. An efficient probabilistic context-free parsing algorithm that computes prefix probabilities. Computational Linguistics, 21(2), pp. 165-201, 1995.
- See also chapters 11-12 of the Manning text.
- See also chapters 9-13 of the Jurafsky text.
The singular value decomposition (SVD), latent semantic indexing (LSI), demensionality reduction, non-negative factorizations.
References:
- J. Bellegarda. Exploiting latent semantic information in statistical language modeling. Proceedings of the IEEE, 88(8), pp. 1279-1296, 2000.
- M. Berry, S. Dumais, and G. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4), pp. 573-595, 1995.
- D. Blei, A. Ng and M. Jordan. Latent Dirichlet allocation. Advances in Neural Information Processing Systems 14, 2001
- Thomas Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1), pp.177-196, 2001.
- S. Deerwester, S. Dumais, G. Furnas, T. Landauer and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), pp. 391-407, 1990.
- C. Papadimitriou, P. Raghavan, H. Tamaki and S. Vempala. Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 61(2), pp. 217--235, 2000.
- D. Lee and H. Seung, Learning the parts of objects with nonnegative matrix factorization. Nature 401, 788 (1999).
- D. Lee and H. Seung. Algorithms for non-negative matrix factorization, Advances in Neural Information Processing Systems 13, pp. 556-562, 2001.
- M. Novak and R. Mammone. Improvement of non-negative matrix factorization based language model using exponential models. IEEE Workshop on Automatic Speech Recognition and Understanding, December 2001.
- Colorado LSA homepage
- Tennessee LSI web page
Random fields and exponential models, duality: maximum likelihood and maximum entropy, iterative scaling, information geometry and alternating minimization, prior and regularization, Bregman distance, latent maximum entropy principle.
References:Applications to natural language processing: feature selection and induction, efficient feature expectation calculation, triggers, exploiting syntactic, semantic and collocational dependencies.
- Adam Berger's tutorial on MaxEnt
- W. Byrne. Alternating minimization and Boltzmann machine learning. IEEE Trans. on Neural Networks, 3(4), pp. 612-620, 1992.
- I. Csiszar and Tusnady. Information geometry and alternating minimization procedures. Statistics and Decisions, Supplement Issue 1, 205-237, 1984.
- I. Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative scaling. The Annals of Statisics, 17(3), pp. 1409-1413. 1989. slides.
- S. Della Pietra, V. Della Pietra and J. Lafferty. Inducing features of random fields. IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(4), pp. 380-393, 1997.
- S. Della Pietra, V. Della Pietra and J. Lafferty. Duality and auxiliary functions for Bregman distances. Technical Report CMU-CS-01-109, School of Computer Science, CMU, 2001.
- J. Darroch and D. Ratchliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), pp. 1470-1480, 1972.
- E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor), D. Reidel Publishing Company, 1983.
- O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-maximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998.
- S. Wang, D. Schuurmans and Y. Zhao. The latent maximum entropy principle. 2002.
References:
- Steven Abney. Stochastic attribute-value grammars. Computational Linguistics, 23(4), pp. 597-618, 1997.
- A. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropy approach to natural language processing . Computational Linguistics, 22(1), 1996.
- S. Chen and R. Rosenfeld. A survey of smoothing techniques for ME models. IEEE Trans. Speech and Audio Processing, 8(1), pp. 37--50, 2000.
- S. Khudanpur and J. Wu. Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling. Computer Speech and Language, 14(4), pp. 355-372, 2000.
- R. Lau. Adaptive statistical language modeling, S.M. Thesis, EECS Department, MIT, 1994
- K. Mark, M. Miller and U. Grenander. Constrained stochastic language models. Image Models And Their Speech Model Cousins, S. Levinson and L. Shepp (Editors), Springer, 1996.
- R. Rosenfeld. A maximum entropy approach to adaptive statistical language modeling. Computer Speech and Language, 10, pp. 187-228, 1996. slides.
- A. Ratnaparkhi. Learning to parse natural language with maximum entropy models, Machine Learning, 34, pp. 151-175, 1999.
- S. Wang, R. Rosenfeld and Y. Zhao. Latent maximum entropy principle for statistical language modeling. IEEE Workshop on Automatic Speech Recognition and Understanding, December 2001.
- See also chapters 13-14 of the Jelinek text.
caches, Bayesian methods.
References:
- R. Kuhn and R. De Mori, A cache-based natural language model for speech recognition, IEEE Trans. on Pattern Analysis and Machine Intelligence, 12(3), pp. 570-583, 1990.
Statistical machine translation, statistical information retrieval, statistical information extraction.
References:
- P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, J. Lafferty, R. Mercer, and P. Roosin. A statistical approach to machine translation. Computational Linguistics 16, 79--85, 1990.
- P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), pp. 263--311, 1993.
- Jamie Callan's slides on statistical information retrieval. 2001.
- The Lemur Toolkit for Language Modeling and Information Retrieval.
Biological sequences: nucleotide bases, amino acids, proteins; biological structures: primary structure, secondary structure, tertiary structure, quaternary structure.
References:
- Durbin text.
- David B. Searls. The linguistics of DNA, American Scientist, 80(6), pp. 579-591, 1992.
- David B. Searls. Computational linguistics of biological sequences, Artificial Intelligence and Molecular Biology, Lawrence Hunter (Editor), MIT Press, 1993.
- E. Segal, B. Taskar, A. Gasch, N. Friedman and D. Koller. Rich probabilistic models for gene expression, Bioinformatics. 17 Suppl 1:S243-52, 2001.
- Workshop on language modeling of biological data, 2001.