next up previous contents
Next: A mixture of n-gram Up: Characters disambiguation Previous: Characters disambiguation

N-gram models at Kyoto University

In order to estimate probabilities with a n-gram model from a corpus, for any n, one needs to count the occurrences of every n-gram in the corpus, which can be computationally inefficient. Doctor Kurohashi's laboratory at Kyoto University uses a simple method to get these estimates quickly for any n.

The method consists of assigning one pointer to every character of the corpus used to train the model. The set of pointers is then sorted with respect to a lexicographic order on the sequences of characters that follow every pointer's character. Once sorted, the number of occurrences of any n-grams can easily be computed by counting the number of pointers that point on a character followed by the full n-gram, as these pointers are sorted.



Jean-Philippe Vert
Sun Dec 6 11:05:42 MET 1998