Description ------------ This corpus contains 30 Wikipedia article pairs in English and Spanish. The articles correspond to three domains in equal proportions: Computer Science, Science, and Sports. Two volunteers, native speakers of Spanish with high command of English, annotated it manually at sentence level considering three classes: parallel, comparable, and other. The mean agreement between annotators had a kappa coefficient of κ∼0.7. A third annotator resolved disagreed sentences. Contents --------- README.txt - This file annotations/ - Folder with the annotations, one file per article pair with name ID_es.es.ann.csv, where ID_es corresponds to the ID of the Wikipedia article in Spanish in the pair. documents/ - Folder with the articles in raw format, one file per article and language. The naming convention is ID_es.es.txt and ID_en.en.txt - documents.txt file linking the IDs of each pair, one pair per line. The length of the articles is also included. Format ------- The format of the annotations is as follows: #line_es #line_en class 26 34 translated-real 11 14 comparable-real Only parallel (translated-real) and comparable (comparable-real) relations are included. Citation --------- Please, cite the following paper if you use this corpus in your work: A Factory of Comparable Corpora from Wikipedia Alberto Barrón-Cedeño, Cristina España-Bonet, Josu Boldoba and Lluís Màrquez In Proceedings of the 8th Workshop on Building and Using Comparable Corpora (BUCC 2015), pages 3-13, July 2015, Beijing, China @InProceedings{Barronetal:2015, author = {{Barr\'on-Cede{\~n}o}, Alberto and {Espa{\~n}a-Bonet}, Cristina and {Boldoba}, Josu and {M\`arquez}, Llu\'{i}s}, title = "{A Factory of Comparable Corpora from Wikipedia}", booktitle = "{Proceedings of the 8th Workshop on Building and Using Comparable Corpora (BUCC 2015)}", pages = {3--13}, year = {2015}, month = {July} date = {30}, address = {Beijing, China}, language = {english} }