Faculty Publications

Impact of corpus size and dimensionality of LSA spaces from Wikipedia articles on autotutor answer evaluation

Zhiqiang Cai, University of MemphisFollow
Arthur C. Graesser, University of MemphisFollow
Leah C. Windsor, University of Memphis
Qinyu Cheng, University of Memphis
David W. Shaffer, University of Wisconsin-Madison
Xiangen Hu, University of Memphis

Abstract

Latent Semantic Analysis (LSA) plays an important role in analyzing text data from education settings. LSA represents meaning of words and sets of words by vectors from a k-dimensional space generated from a selected corpus. While the impact of the value of k has been investigated by many researchers, the impact of the selection of documents and the size of the corpus has never been systematically investigated. This paper tackles this problem based on the performance of LSA in evaluating learners’ answers to AutoTutor, a conversational intelligent tutoring system. We report the impact of document sources (Wikipedia vs TASA), selection algorithms (keyword based vs random), corpus size (from 2000 to 30000 documents) and number of dimensions (from 2 to 1000). Two AutoTutor tasks are used to evaluate the performance of different LSA spaces: a phrase level answer assessment (responses to focal prompt questions) and a sentence level answer assessment (responses to hints). We show that a sufficiently large (e.g., 20,000 to 30,000 documents) randomly selected Wikipedia corpus with high enough dimensions (about 300) could provide a reasonably good space. A specifically selected domain corpus could have significantly better performance with a relatively smaller corpus size (about 8000 documents) and much lower dimensionality (around 17). The widely used TASA corpus (37,651 documents scientifically sampled) performs equally well as a randomly selected large Wikipedia corpus (20,000 to 30,000) with a sufficiently high dimensionality (e.g., k>=300).

Publication Title

Proceedings of the 11th International Conference on Educational Data Mining, EDM 2018

Recommended Citation

Cai, Z., Graesser, A., Windsor, L., Cheng, Q., Shaffer, D., & Hu, X. (2018). Impact of corpus size and dimensionality of LSA spaces from Wikipedia articles on autotutor answer evaluation. Proceedings of the 11th International Conference on Educational Data Mining, EDM 2018 Retrieved from https://digitalcommons.memphis.edu/facpubs/8039

This document is currently not available here.

COinS

Faculty Publications

Impact of corpus size and dimensionality of LSA spaces from Wikipedia articles on autotutor answer evaluation

Abstract

Publication Title

Recommended Citation

Search

Browse

Author Corner

Libraries

Faculty Publications

Impact of corpus size and dimensionality of LSA spaces from Wikipedia articles on autotutor answer evaluation

Authors

Abstract

Publication Title

Recommended Citation

Share

Search

Browse

Author Corner

Libraries