Pooling word vector representations across models


Vector based word representation models are typically developed from very large corpora with the hope that the representations are reliable and have wide coverage, i.e. they cover, ideally, all words. However, we often encounter words in real world applications that are not available in a single vector-based model. In this paper, we present a novel Neural Network (NN) based approach for obtaining representations for words that are missing in a target model from another model, called the source model, where representations for these words are available, effectively pooling together their vocabularies and the corresponding representations. Our experiments with three different types of pre-trained models (Word2vec, GloVe, and LSA) show that the representations obtained using our transformation approach can substantially and effectively extend the word coverage of existing models. The increase in the number of unique words covered by a model varies from few to several times depending on which model vocabulary is taken as reference. The transformed word representations are also well correlated (average correlation up to 0.801 for words in Simlex-999 dataset) with the native target model representations indicating that the transformed vectors can effectively be used as substitutes of native word representations. Furthermore, an extrinsic evaluation based on a word-to-word similarity task using the Simlex-999 dataset leads to results close to those obtained using native target model representations.

Publication Title

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)