A vectorial approach for information extraction at NTT

Next: Semantic vectors at Shinshu Up: Non-organized semantic representation Previous: Non-organized semantic representation

A vectorial approach for information extraction at NTT

NTT's Human Interface Laboratory at Yokasuka, between Tokyo and Yokohama, was working on a project of topic extraction from broadcasted English news. It is based on a list of 70,000 words that describe topics, this list being therefore a concept base for a vector space representation of concepts.

Using information records a distance matrix was created between words of this list and any English word, by counting co-occurrences between the words of the list appearing in the headlines and all the words being used inside the news. The resulting matrix could then be used to create mutual information or models, in order to finally define a score between each word of the list and each English word. The score of a word from the list with respect to a news article can thereafter be computed as the normalized sum of the scores of this word with respect to the words composing the news.

This model can be used to extract the words of the list that have the highest scores with respect to a given news report, in order to automatically obtain its topic. Conversely it is possible to retrieve the news article with respect to which a given word has the highest scores.

This method implicitly makes use of a 70,000-dimensional vector space representation of concepts, in which every English word is represented through its scores. Geometric tools naturally appear, as the use of average to define the scores of a series of words.

Jean-Philippe Vert
Sun Dec 6 11:05:42 MET 1998