Words similarity measurement at NTT CS Laboratory

Next: Applications Up: Mixed approaches Previous: Automatic classification of documents

Words similarity measurement at NTT CS Laboratory

NTT Communication Science (CS) Laboratory, located near the ancient imperial capital Nara, has created a hierarchy of about 3,000 semantic concepts linked to each others through "has-a" and "is-a" relationships. Even though it might appear much smaller than the 400,000-concepts EDR dictionary, it aims at being more robust than its elderly sister.

This concept base, called "knowledge base", is used as a base for a 3,000 dimension vector space in which all Japanese words are represented as vectors. The set of all Japanese words has first been reduced to a 40,000 concept-words set after standardization by NTT's thesaurus. The result is therefore a 3,000 concepts 40,000 concept-words matrix, in which concept-words coordinates are normalized.

This construction enables the computation of two words similarity with respect to a particular viewpoint. With respect to the viewpoint "animal", for instance, the word "horse" is closer to the word "rabbit" than to the word "car", but the result is in the reverse order with respect to the viewpoint "transportation". To take care of the viewpoint in the measure of similarity, vectorial operations are used, e.g. projections to project a vector on a viewpoint and innerproduct to measure similarities.

Although NTT's approach looks like a purely vector space representation, one should notice that the base of the vector space is composed on concepts that are organized in a semantic graph, which open new fields of investigation for deeper analysis.

Jean-Philippe Vert
Sun Dec 6 11:05:42 MET 1998