Using a textual database made of articles from the Wall Street Journal researchers from the Advanced Telecommunications Research Institute (ATR) near Kyoto have obtained a hierarchical classification of the 70,000 words used the most often. The result of this classification is a binary tree where the 70,000 terminal leaves represent the 70,000 words, and where each node represent a class of words that contains the words of the children nodes.
The tree was built automatically, starting from 70,000 isolated leaves and clustering iteratively the classes used in similar contexts. Further, every node of the tree and thus every concept can be coded by a series of bits.