next up previous contents
Next: Use of a concept Up: Information retrieval and extraction Previous: 5W1H classification at NEC

Information extraction with template matching at Kyutech

Nomura laboratory at the Kyushu Institute of Technology has developed a software intended for information extraction from written news. A demonstration was available on the Internet in July 1998. It worked with a database of 2,000 news articles that dealt with the commercialization of new goods, from which several pieces of information could be extracted, e.g. type of goods, name, manufacturer, price etc...

A morphological parsing of all texts was first carried out with the parser JUMAN developed by Kyoto University (see gif). The surface information provided by this analysis was then processed by the information search engine through template matching. In other words, any type of information was characterized by a series of templates concerning the linguistic environments of the information concerned - e.g., the Japanese particle that is used before a date - and the information itself. The search engine finds candidates for each piece of information through a classical template matching, by applying every template to every word of a given news article. This leads to the production of candidates for every sentence, and the final candidates selected by the search engine will be chosen according to the number of templates that selected each candidate and the number of sentences that extracted them.

A set of almost 4,000 templates was first created in order to characterize the information to be extracted. A reduction procedure using a corpus then reduced that number to about 1,400. The precision of extraction appears to be over 90 percent correct extraction in the public demonstration.



Jean-Philippe Vert
Sun Dec 6 11:05:42 MET 1998