next up previous contents
Next: Syntactic parsing Up: Segmentation and morphological analysis Previous: A statistical approach at

Decision trees without dictionaries in ATR

At the Advanced Telecommunications Research Institute (ATR), researchers have pointed out the proliferation of tagging systems and associated dictionaries for the morphological analysis of Japanese. They have therefore developed a robust parser based on decision trees and that does not require any dictionary, in order to overcome the following frequent problems:

This method consists in creating a large number of questions, whose answers should contain information about the words boundaries and POS. For example, the strings "ing" at the end of a word in English might indicate that the word is a verb. In the same way, many structures in Japanese can give information about the morphology of texts. Such observations result in a series of questions that can combine several points of interest (e.g. "the word ends with hiragana characters and the following character is a kanji").

The resulting series of questions is then used to create decision trees with classical methods, thanks to a training step done with an already annotated corpus. The tag sets used in the experiments are composed of 209 different tags, grouped in 18 classes (common noun, proper noun etc...).

Once trained, each final leaf of the decision tree contains a probability distribution for the morphological structure (segmentation and tagging) of input sentences.



Jean-Philippe Vert
Sun Dec 6 11:05:42 MET 1998