Automatic classification of documents at Tokushima University

Next: Words similarity measurement at Up: Mixed approaches Previous: Mixed approaches

Automatic classification of documents at Tokushima University

Aoe laboratory at Tokushima University, on Shikoku's island, has developed a system for automatic classification of textual documents in Japanese located in a folder, that will soon be commercialized. The classification of carried out with respect to the content of the documents to be classified, and is done in a two-steps process:

retrieval of keywords in the documents;
classification of documents using a hierarchy of concepts.

The keyword retrieval in a document is obtained by counting absolute and relative frequencies of a series of 40,000 character bigrams, to extract the ones that offer the best characterization for the document considered. That step can thus be considered as typical of vector space representations.

The second step however uses a semantic hierarchy on keywords in order to obtain a hierarchical classification of the set of documents itself. This step is therefore typical of structured concept representation.

The combination of these two approaches in order to classify a textual database with respect to the semantical content of the documents has the advantage of making use of computationally efficient tools through the vector representation, and integrating much semantic information with the pre-existing hierarchy of keywords.

Jean-Philippe Vert
Sun Dec 6 11:05:42 MET 1998