Resources
This page collects some of the resources gathered or developed for the SECOND VOICE, TACARDI and OPENMT2 projects that can be useful for other researchers.
Kaldi recipe for dysarthric speakers
Kaldi recipe to build an ASR for speakers with dysarthria. The recipe works on the Torgo database and several models are used in the implemented pipeline. Find it at GitHub:
https://github.com/cristinae/ASRdys
WikiTailor software, in-domain multilingual comparable and parallel corpora extraction in TACARDI
Software for the extraction of corpora in any domain and language existing in Wikipedia. Currently, it allows to extract in-domain multilingual comparable corpora of articles in any domain and extracts its titles in order to build a parallel/multilingual corpus. If you want to be a beta-tester, ask for the code, it will be publicly available soon!
Wikipedia test corpora (parallelism and comparability)
The comparable corpus contains 30 Wikipedia article pairs in English and Spanish. The articles belong to three domains in equal proportions: Computer Science, Science, and Sports. Documents are annotated manually at sentence level with three possible labels: parallel, comparable, and other.
The parallel corpus contains 2400 sentences extracted from Wikipedia articles in English and Spanish manually revised. As before, the articles belong to three domains in equal proportions: Computer Science, Science, and Sports.
Please, cite the following paper if you use these data in your work:
Stopword lists
Stopword list compiled for Occitan.
Embeddings with ~109 words (en/es/de)
Embeddings obtained with Word2vec for English (2.3 Mw), Spanish (0.8 Mw) and German (0.7 Mw).
EMT software, hybrid machine translation in OPENMT2
Combine and decoding module for the SMatxinT translation system. Find it at GitHub:
https://github.com/cristinae/EMT