Cristina España i Bonet

Resources

This page collects some of the resources gathered or developed for the SECOND VOICE, TACARDI and OPENMT2 projects that can be useful for other researchers.

Kaldi recipe for dysarthric speakers
WikiTailor software, in-domain multilingual comparable and parallel corpora extraction in TACARDI
Wikipedia test corpora (parallelism and comparability)
Embeddings with ~10⁹ words (en/es/de)
Stopword lists
EMT software, hybrid machine translation in OPENMT2

Kaldi recipe for dysarthric speakers

Kaldi recipe to build an ASR for speakers with dysarthria. The recipe works on the Torgo database and several models are used in the implemented pipeline. Find it at GitHub:

https://github.com/cristinae/ASRdys

WikiTailor software, in-domain multilingual comparable and parallel corpora extraction in TACARDI

Software for the extraction of corpora in any domain and language existing in Wikipedia. Currently, it allows to extract in-domain multilingual comparable corpora of articles in any domain and extracts its titles in order to build a parallel/multilingual corpus. If you want to be a beta-tester, ask for the code, it will be publicly available soon!

Wikipedia test corpora (parallelism and comparability)

The comparable corpus contains 30 Wikipedia article pairs in English and Spanish. The articles belong to three domains in equal proportions: Computer Science, Science, and Sports. Documents are annotated manually at sentence level with three possible labels: parallel, comparable, and other.

README

Documents

The parallel corpus contains 2400 sentences extracted from Wikipedia articles in English and Spanish manually revised. As before, the articles belong to three domains in equal proportions: Computer Science, Science, and Sports.

README

WikiSets

Please, cite the following paper if you use these data in your work:

A Factory of Comparable Corpora from Wikipedia

Alberto Barrón-Cedeño, Cristina España-Bonet, Josu Boldoba and Lluís Màrquez

Proceedings of the 8th Workshop on Building and Using Comparable Corpora (BUCC), pages 3-13, Beijing, China, July 2015.

[ BibTeX ]

@InProceedings{Barronetal:2015,
       author = {{Barr\'on-Cede{\~n}o}, Alberto and {Espa{\~n}a-Bonet}, Cristina and 
       			{Boldoba}, Josu and {M\`arquez}, Llu\'{i}s},
        title =	"{A Factory of Comparable Corpora from Wikipedia}",
    booktitle = "{Proceedings of the 8th Workshop on Building and Using Comparable Corpora (BUCC)}",
        pages = {3--13},
         year =	{2015},
        month = {July}
         date = {30},
      address = {Beijing, China},
     language = {english},
  	  url = {http://www.aclweb.org/anthology/W15-3402}
}

Stopword lists

Stopword list compiled for Occitan.

README

SW list

Embeddings with ~10⁹ words (en/es/de)

Embeddings obtained with Word2vec for English (2.3 Mw), Spanish (0.8 Mw) and German (0.7 Mw).

README

EMT software, hybrid machine translation in OPENMT2

Combine and decoding module for the SMatxinT translation system. Find it at GitHub:

https://github.com/cristinae/EMT

A Hybrid Machine Translation Architecture Guided by Syntax

Gorka Labaka, Cristina España-Bonet, Lluís Màrquez, Kepa Sarasola

Machine Translation Journal, Vol. 28, Issue 2, pages 91-125, October, 2014.

[ BibTeX arXiv ]

@article{labakaetal14,
       author = {Labaka, Gorka and Espa{\~n}a-Bonet, Cristina and M\`arquez, Llu\'is and Sarasola, Kepa},
        title =	{A hybrid machine translation architecture guided by syntax},
      journal = {Machine Translation},
          doi = {10.1007/s10590-014-9153-0},
       volume = 28,
        issue = 2,
        pages = {91-125},
         year =	{2014},
        month = {October},
         issn = {0922-6567},
          url = {http://dx.doi.org/10.1007/s10590-014-9153-0},
    publisher = {Springer Netherlands}
}

Cristina España i Bonet

Resources

Kaldi recipe for dysarthric speakers

WikiTailor software, in-domain multilingual comparable and parallel corpora extraction in TACARDI

Wikipedia test corpora (parallelism and comparability)

Stopword lists

Embeddings with ~109 words (en/es/de)

EMT software, hybrid machine translation in OPENMT2

Embeddings with ~10⁹ words (en/es/de)