Kaldi recipe for dysarthric speakers

Kaldi recipe to build an ASR for speakers with dysarthria. The recipe works on the Torgo database and several models are used in the implemented pipeline. Find it at GitHub:

https://github.com/cristinae/ASRdys

WikiTailor software, in-domain multilingual comparable and parallel corpora extraction in TACARDI

Software for the extraction of corpora in any domain and language existing in Wikipedia. Currently, it allows to extract in-domain multilingual comparable corpora of articles in any domain and extracts its titles in order to build a parallel/multilingual corpus. If you want to be a beta-tester, ask for the code, it will be publicly available soon!

Wikipedia test corpora (parallelism and comparability)

The comparable corpus contains 30 Wikipedia article pairs in English and Spanish. The articles belong to three domains in equal proportions: Computer Science, Science, and Sports. Documents are annotated manually at sentence level with three possible labels: parallel, comparable, and other.

The parallel corpus contains 2400 sentences extracted from Wikipedia articles in English and Spanish manually revised. As before, the articles belong to three domains in equal proportions: Computer Science, Science, and Sports.

Please, cite the following paper if you use these data in your work:

A Factory of Comparable Corpora from Wikipedia
Alberto Barrón-Cedeño, Cristina España-Bonet, Josu Boldoba and Lluís Màrquez
Proceedings of the 8th Workshop on Building and Using Comparable Corpora (BUCC), pages 3-13, Beijing, China, July 2015.
[ BibTeX ]

Stopword lists

Stopword list compiled for Occitan.

Embeddings with ~109 words (en/es/de)

Embeddings obtained with Word2vec for English (2.3 Mw), Spanish (0.8 Mw) and German (0.7 Mw).

EMT software, hybrid machine translation in OPENMT2

Combine and decoding module for the SMatxinT translation system. Find it at GitHub:

https://github.com/cristinae/EMT

A Hybrid Machine Translation Architecture Guided by Syntax
Gorka Labaka, Cristina España-Bonet, Lluís Màrquez, Kepa Sarasola
Machine Translation Journal, Vol. 28, Issue 2, pages 91-125, October, 2014.
[ BibTeX arXiv ]