next up previous contents
Next: Automatic Summarization Up: Characters disambiguation Previous: N-gram models at Kyoto

A mixture of n-gram models at Tohoku University

Aso laboratory at Tohoku University in Sendai is specialized in intelligent digitalization of paper documents. It has in particular developed a character recognition system including a disambiguation engine.

This engine uses n-gram models as approximations of language models, but is original in so far as it mixes the models with n=0,1,2,3. As a result the final model is a linear combination of these different models where the weights for each models are adjusted with respect to the size of the training corpus. This enables the model to get approximations involving 3-grams even when the corpus is too small to obtain a consistent 3-gram model, as the weights are estimated in order to maximize the consistency of the final model.



Jean-Philippe Vert
Sun Dec 6 11:05:42 MET 1998