Natural Language Processing

Modern Natural Language Processing and Information Retrieval techniques typically rely on statistical language models extracted from large (eg 200M word) corpora or text collections. More recent approaches have seen the introduction of more and more context-sensitive approaches - i.e. approaches where the probability of a word occurring is not modelled as constant throughout a text, and hence, where it does not suffice to rely on simple word occurrence counts.

Instead, the techniques include, for example, processing of large numbers of contexts and n-grams (of which there need to be many lest the data is too sparse), advanced statistical techniques (eg Baysian statistics supported by Markov Chain Monte Carlo methods), advanced machine learning techniques for the induction of probabilistic grammars, and several compute-intensive methods for dimensionality reduction in shallow text representation models (such as LSA). In other words, as well as data-intensive, all these methods are also compute intensive. In addition, many standard packages in NLP and IR assume an underlying Unix (or Linux) architecture.

All the examples above are drawn from NLP and IR projects over the last 6 years or so. These include 4 PhDs (Sarkar on Term Burstiness; Chantree on Ambiguity Detection; Haley on using LSA in automatic Marking; Nanas on adaptive Information Filtering), 2 EPSRC funded projects (Autoadapt; Context Sensitive Information Retrieval). They will also be required for a recently funded EPSRC project based on Chantree's work (Matrex) and a Jisc project, all of which will be using this range of approaches and techniques. We also envisage using the array for grammar induction, to allow the OU to scale up its current capacity for marking short answers automatically (we are in the process of bidding for strategic funding for this, in collaboration with the COMSTL CETL, the VLE and the Science Faculty).

The Linux array is a vital piece of infrastructure for this line of research. The processing volume we require is on the increase as several of the projects we have secured carry a requirement of developing tools which will require not only access to large datasets (on a scale we have not engaged with before), but also the need to deliver reasonable results at run time.