Description ------------ This folder contents two datasets with parallel sentences in English and Spanish. The sentences correspond to three domains in equal proportions: Computer Science, Science, and Sports. The sets were obtained in a semiautomatic way. We departed from parallel corpora gathered automatically and sentences with more than four tokens and beginning with a letter were selected as candidates for the final sets. We estimated its perplexity with respect to a language model obtained with Europarl in order to select the most fluent sentences. Then, the parallel sentences were ranked according to their similarity and perplexity. The top-n fragments were manually revised and extracted to build these Wikipedia sets. Contents --------- README.txt - This file wk.test.tok.en - Set1 (test) 1500 tokenised sentences in English wk.test.tok.es - Set1 (test) the corresponding parallel sentences in Spanish wk.dev.tok.en - Set2 (dev) 900 tokenised sentences in English wk.dev.tok.es - Set2 (dev) the corresponding parallel sentences in Spanish Citation --------- Please, cite the following paper if you use this corpus in your work: A Factory of Comparable Corpora from Wikipedia Alberto Barrón-Cedeño, Cristina España-Bonet, Josu Boldoba and Lluís Màrquez In Proceedings of the 8th Workshop on Building and Using Comparable Corpora (BUCC 2015), pages 3-13, July 2015, Beijing, China @InProceedings{Barronetal:2015, author = {{Barr\'on-Cede{\~n}o}, Alberto and {Espa{\~n}a-Bonet}, Cristina and {Boldoba}, Josu and {M\`arquez}, Llu\'{i}s}, title = "{A Factory of Comparable Corpora from Wikipedia}", booktitle = "{Proceedings of the 8th Workshop on Building and Using Comparable Corpora (BUCC 2015)}", pages = {3--13}, year = {2015}, month = {July} date = {30}, address = {Beijing, China}, language = {english} }