Multi-hierarchy documents clustering based on LSA space dimensionality character


The statistical characteristics of dimensionality in latent semantic analysis (LSA) space were studied to realize automatic document clustering under different concept levels. It is concluded that dimensionalities corresponding bigger singular values describe commonness among semantic elements, while dimensionalities corresponding smaller ones describe discrepancy. There exists some latent relation between dimensionalities in LSA Space and concept granularities in natural languages. Different dimensionalities of LSA Space are adopted for document clustering under certain concept granularity. Experimental results are in good agreement with the above idea. In addition, in the LSA-based algorithm of document clustering, better clustering precisions are obtained by taking the row vectors of document self-indexing matrix as the objects to be clustered, instead of document vectors with low dimensions.

Publication Title

Qinghua Daxue Xuebao/Journal of Tsinghua University

