My favorite papers at the NAACL’2013 conference in Atlanta last week were “Learning a Part-of-Speech Tagger From Two Hours of Annotation” (Dan Garrette and Jason Baldridge, University of Texas at Austin) and “Improved Reordering for Phrase-Based Translation with Sparse Features” (Colin Cherry, NRC Canada).
Garrette and Baldridge show how by starting from annotated data that can be obtained with minimal effort (2 hours of work), one can apply existing model minimization, unsupervised, and supervised learning techniques in order to produce a reasonable tagger. By starting from type-supervised data generated automatically from existing annotated corpora, the authors could go a step further and answer not only the question of how good a POS tagger one can build with two hours of annotation investment, but also, how many hours of annotation investment are needed in order to deliver a state-of-the-art POS tagger.
The reason I liked Garrette and Baldridge’s paper most though is because it breaks away from an increasingly worrisome trend in the field: unsupervised learning for the sake of unsupervised learning. For many natural language processing (NLP) tasks, from Part-of-Speech tagging to dependency parsing, unsupervised learning is producing crappy, unusable results. In my view, spending years of effort and entire PhDs advancing the state of the art from 40 to 50% accuracy when 2 hours of investment in annotation takes one to 80-90% accuracy does not seem like the kind of impact one would hope to get from some of the top talent in the field. I am hopeful that Garrette and Baldridge’s approach will encourage NLP researchers to write more papers that fall into the category “Look Mom! I improved the state-of-the-art with only $250 of investment in data annotation” and fewer papers that fall into the category “Look Mom! I spent three years of my life and no money on data annotation, but I got a system that is 53% below the state of the art.”
In the second paper, Cherry shows how one can estimate and exploit sparse re-ordering features in statistical machine translation. By tightly integrating the sparse model estimation into the decoder, Cherry gets statistically significant increases in translation accuracy. Any paper that advances the state of the art in reordering for SMT ranks highly on my list – getting the target words into the right order is arguably one of the most difficult challenges in machine translation today. However, this paper got me excited for a different reason altogether. Training SMT engines have, to date, followed two classes of approaches: at one extreme, most researchers estimate first classes of parameters (a direct translation model, a reordering model, etc) using maximum likelihood estimators and then weight the importance of these classes by decoding and re-decoding a tuning corpus. More recently, several researchers have gone to the other extreme, where they estimate all hundreds of millions of SMT parameters online, via decoding again and again the entire training corpus. Cherry’s paper falls into the middle – most parameters are estimated cheaply, using ML estimators. However, a handful of parameters, those that govern reordering, are estimated online, in conjunction with other, more granular weights that govern classes of parameters/sub-models.
Almost 15 years ago, it took me a few years to understand that IBM Models 1 and 2 are not only the first documented steps of an intellectual journey that eventually took the IBM team to proposing Model 5, but more importantly, a required prop for enabling Models 3-5 estimation not to get stuck in some horrible local maxima (training Models 3-5 directly, without biasing them with the parameters learned by Model 1 leads to terrible results). Which makes me wonder whether Colin Cherry’s approach should be further evolved into one where online training methods are constrained first to learn only a subset of parameters and then encouraged to learn incrementally more and more complex classes of parameters in an effort to avoid local maxima that have nothing to do with translation quality.