next up previous contents
Next: Document classification Up: Information retrieval and extraction Previous: Information extraction with template

Use of a concept base at NTT

NTT Communication Science (CS) Laboratory had an information retrieval project under development in collaboration with Stanford University and the Stanford Japan Center. This retrieval system is based on the concept base (see gif) that enables the measurement of similarity between words and the clustering of texts according to their content.

The concept base that gives information on the meaning of words contains about 20,000 concepts. Any set of words, e.g. a request, can be represented by a vector in the 20,000-dimensional concept vector space. Similarity between two words or sets of words is defined as the innerproduct of their representative vectors, and the vector of a set of words is defined as the average of the vectors of the words. Conversely the word characteristics for a vector of the concept space are the words whose vectors are close to the vector considered with respect to Euclidean norm.

These definitions can be used to expand a request, by adding those words that are characteristic of the vector representing the request. Moreover, articles with similar vectors can be clustered and characteristic words can be extracted from each class. In particular it is possible to retrieve texts that don't explicitly contain the words of the query but are semantically close to it.

While only a Japanese version was available during the visit of NTT's laboratory an English version has been developed at Stanford University.



Jean-Philippe Vert
Sun Dec 6 11:05:42 MET 1998