Faculty Publications

On paraphrase identification corpora

Abstract

We analyze in this paper a number of data sets proposed over the last decade or so for the task of paraphrase identification. The goal of the analysis is to identify the advantages as well as shortcomings of the previously proposed data sets. Based on the analysis, we then make recommendations about how to improve the process of creating and using such data sets for evaluating in the future approaches to the task of paraphrase identification or the more general task of semantic similarity. The recommendations are meant to improve our understanding of what a paraphrase is, offer a more fair ground for comparing approaches, increase the diversity of actual linguistic phenomena that future data sets will cover, and offer ways to improve our understanding of the contributions of various modules or approaches proposed for solving the task of paraphrase identification or similar tasks. We also developed a data collection tool, called Data Collector, that proactively targets the collection of paraphrase instances covering linguistic phenomena important to paraphrasing.

Publication Title

Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014

Recommended Citation

Rus, V., Banjade, R., & Lintean, M. (2014). On paraphrase identification corpora. Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014, 2422-2429. Retrieved from https://digitalcommons.memphis.edu/facpubs/3037

This document is currently not available here.

COinS

Faculty Publications

On paraphrase identification corpora

Abstract

Publication Title

Recommended Citation

Search

Browse

Author Corner

Libraries

Faculty Publications

On paraphrase identification corpora

Authors

Abstract

Publication Title

Recommended Citation

Share

Search

Browse

Author Corner

Libraries