Llistat de publicacions i treballs relacionats
2020
Understanding Translationese in Multi-view Embedding Spaces
Koel Dutta Chowdhury, Cristina España-Bonet and Josef van Genabith
Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020), pages 6056-6062, December 2020.
[
Abstract
PDF
BibTeX
arXiv
]
The term translationese refers to systematic differences between translations and text originally authored in the target language of the translation (in the same genre and style). In this paper, we use departures from isomorphism between embedding-based vector spaces from translations and originally authored data to estimate phylogenetic language family relations induced from single target language translation from multiple source languages. We explore multi-view embedding spaces based on words, part-of-speech, semantic tags, and synsets, to capture lexical, morphological and semantic aspects of translationese and to investigate the impact of topic on the data. Our results show that (i) language family relationships can be inferred from the monolingual embedding data, providing evidence for shining-through (source language interference) translationese effects in the data and (ii) that, perhaps surprisingly, even delexicalised embeddings exhibit significant source language interference, indicating that the lexicalised results are due to possible differences in topic between original and translated texts.
@InProceedings{DuttaEtal:COLING:2020,
author = {Dutta Chowdhury, Koel and Espa{\~n}a-Bonet, Cristina and van Genabith, Josef},
title = "Understanding Translationese in Multi-view Embedding Spaces",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
month = dec,
year = "2020",
address = "Barcelona, Catalonia (Online)",
publisher = "International Committee on Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.coling-main.532",
pages = "6056--6062"
}
Self-Induced Curriculum Learning in Self-Supervised Neural Machine Translation
Dana Ruiter, Josef van Genabith and Cristina España-Bonet
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2560-2571, November 2020.
[
Abstract
PDF
BibTeX
]
Self-supervised neural machine translation (SSNMT) jointly learns to identify and select suitable training data from comparable (rather than parallel) corpora and to translate, in a way that the two tasks support each other in a virtuous circle. In this study, we provide an in-depth analysis of the sampling choices the SSNMT model makes during training. We show how, without it having been told to do so, the model self-selects samples of increasing (i) complexity and (ii) task-relevance in combination with (iii) performing a denoising curriculum. We observe that the dynamics of the mutual-supervision signals of both system internal representation types are vital for the extraction and translation performance. We show that in terms of the Gunning-Fog readability index, SSNMT starts extracting and learning from Wikipedia data suitable for high school students and quickly moves towards content suitable for first year undergraduate students.
@InProceedings{ruiterEtAl:EMNLP:2020,
author = {Dana Ruiter and Josef van Genabith and Cristina Espa\~na-Bonet},
title = "{Self-Induced Curriculum Learning in Self-Supervised Neural Machine Translation}",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.202",
doi = "10.18653/v1/2020.emnlp-main.202",
pages = "2560--2571"
}
Statistical Machine Translation: Main Components
Cristina España-Bonet
Invited talk at the 1r Congreso Internacional de Procesamiento de Lenguaje Natural para Lenguas Indígenas, Morelia, México, 5th November 2020.
Some Aspects of Linguistic Diversity in Europe and Africa
Cristina España-Bonet
Invited talk at the SPARC International Symposium on Mahatma Gandhi and Linguistic Diversity, 23rd September 2020.
Query or Document Translation for Academic Search — What's the real Difference?
Vivien Petras, Andreas Lüschow, Roland Ramthun, Juliane Stiller, Cristina España-Bonet and Sophie Henning
Experimental IR Meets Multilinguality, Multimodality, and Interaction, 11th International Conference of the CLEF Association, CLEF
2020, Thessaloniki, Greece, September 22-25, 2020. Lecture Notes in Computer Science, Vol. 12260, pages 28-42, Springer.
[
Abstract
PDF
BibTeX
]
We compare query and document translation from and to English, French, German and Spanish for multilingual retrieval in an academic search portal: PubPsych. Both query and document translation improve the retrieval performance of the system with document translation providing better results. We show how performance inversely correlates with the amount of available original language documents. The more documents already available in a language, the fewer improvements can be observed. Retrieval performance with English as a source language does not improve with translation as most documents already contained English-language content in our text collection. The large-scale evaluation study is based on a corpus of more than 1 M metadata documents and 50 real queries in English, French, German and Spanish taken from the query log files of the portal.
@InProceedings{petrasEtAl:CLEF:2020,
author = {Vivien Petras, Andreas L\"uschow, Roland Ramthun, Juliane Stiller, Cristina Espa{\~n}a-Bonet and Sophie Henning},
title = "{Query or Document Translation for Academic Search -- What's the real Difference?}",
booktitle = {Experimental {IR} Meets Multilinguality, Multimodality, and Interaction
- 11th International Conference of the {CLEF} Association, {CLEF}
2020, Thessaloniki, Greece, September 22-25, 2020, Proceedings},
series = {Lecture Notes in Computer Science},
volume = {12260},
pages = {28--42},
publisher = {Springer},
year = {2020},
doi = {10.1007/978-3-030-58219-7\_3},
key = {CLEF 2020},
year = {2020},
month = {September},
address = {Thessaloniki, Greece},
}
How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech
Yuri Bizzoni, Tom S Juzek, Cristina España-Bonet, Koel Dutta Chowdhury, Josef van Genabith and Elke Teich
Proceedings of the 17th International Workshop on Spoken Language Translation (IWSLT), pages 280-290, Seattle, WA, United States, July 2020.
[
Abstract
PDF
BibTeX
arXiv
]
Translationese is a phenomenon present in human translations, simultaneous interpreting, and even machine translations. Some translationese features tend to appear in simultaneous interpreting with higher frequency than in human text translation, but the reasons for this are unclear. This study analyzes translationese patterns in translation, interpreting, and machine translation outputs in order to explore possible reasons. In our analysis we (i) detail two non-invasive ways of detecting translationese and (ii) compare translationese across human and machine translations from text and speech. We find that machine translation shows traces of translationese, but does not reproduce the patterns found in human translation, offering support to the hypothesis that such patterns are due to the model (human vs. machine) rather than to the data (written vs. spoken).
@InProceedings{BizzoniEtal:IWSLT:2020,
author = {Bizzoni, Yuri and Juzek, Tom S and Espa{\~n}a-Bonet, Cristina and Dutta Chowdhury, Koel and van Genabith, Josef and Teich, Elke},
title = "How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech",
booktitle = "Proceedings of the 17th International Conference on Spoken Language Translation",
month = jul,
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.iwslt-1.34",
pages = "280--290"
}
Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction
Cristina España-Bonet, Alberto Barrón-Cedeño, Lluís Màrquez
arXiv pre-print 2005.01177, May 2020.
[
Abstract
PDF
BibTeX
arXiv
]
We propose an automatic language-independent graph-based method to build à-la-carte article collections on user-defined domains from the Wikipedia. The core model is based on the exploration of the encyclopaedia's category graph and can produce both monolingual and multilingual comparable collections. We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743 domains. According to an extensive manual evaluation, our graph-based model outperforms a retrieval-based approach and reaches an average precision of 84% on in-domain articles. As manual evaluations are costly, we introduce the concept of "domainness" and design several automatic metrics to account for the quality of the collections. Our best metric for domainness shows a strong correlation with the human-judged precision, representing a reasonable automatic alternative to assess the quality of domain-specific corpora. We release the WikiTailor toolkit with the implementation of the extraction methods, the evaluation measures and several utilities. WikiTailor makes obtaining multilingual in-domain data from the Wikipedia easy.
@InProceedings{EspanaBonetEtal:2020,
author = {{Espa{\~n}a-Bonet}, Cristina and {Barr\'on-Cede{\~n}o}, Alberto and {M\`arquez}, Llu\'{i}s},
title = "{Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction}",
journal = {arXiv e-prints},
keywords = {Computer Science - Computation and Language, Computer Science - Information Retrieval},
year = 2020,
month = may,
pages = {1--26},
archivePrefix = {arXiv},
eprint = {2005.01177},
primaryClass = {cs.CL}
}
Multilingual and Interlingual Semantic Representations for Natural Language Processing: A Brief Introduction
Marta R. Costa-jussà, Cristina España-Bonet, Pascale Fung and Noah A. Smith
Special Issue of Computational Linguistics: Multilingual and Interlingual Semantic Representations for Natural Language Processing, pages 1-8, March 2020
[
Abstract
PDF
BibTeX
]
We introduce the Computational Linguistics special issue on Multilingual and Interlingual Semantic Representations for Natural Language Processing. We situate the special issue's five articles in the context of our fast-changing field, explaining our motivation for this project. We offer a brief summary of the work in the issue, which includes developments on lexical and sentential semantic representations, from symbolic and neural perspectives.
@article{ruizEtal:2020,
title = "Multilingual and Interlingual Semantic Representations for Natural Language Processing: A Brief Introduction",
author = "Costa-juss{\`a}, Marta and Espa{\~n}a-Bonet, Cristina and Fung, Pascale and Smith, Noah A.",
publisher = {MIT Press},
address = {Cambridge, MA, USA},
journal = {Computational Linguistics},
month = mar,
year = "2020",
doi = "10.1162/COLI_a_00373",
pages = "1--8"
}
Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of Yorùbá and Twi
Jesujoba O. Alabi, Kwabena Amponsah-Kaakyire, David I. Adelani and Cristina España-Bonet
Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), pages 2754-2762 , Marseille, France, May 2020.
[
Abstract
PDF
BibTeX
]
The success of several architectures to learn semantic representations from unannotated text and the availability of these kind of texts in online multilingual resources such as Wikipedia has facilitated the massive and automatic creation of resources for multiple languages. The evaluation of such resources is usually done for the high-resourced languages, where one has a smorgasbord of tasks and test sets to evaluate on. For low-resourced languages, the evaluation is more difficult and normally ignored, with the hope that the impressive capability of deep learning architectures to learn (multilingual) representations in the high-resourced setting holds in the low-resourced setting too.
In this paper we focus on two African languages, Yorùbá and Twi, and compare the word embeddings obtained in this way, with word embeddings obtained from curated corpora and a language-dependent processing. We analyse the noise in the publicly available corpora, collect high quality and noisy data for the two languages and quantify the improvements that depend not only on the amount of data but on the quality too. We also use different architectures that learn word representations both from surface forms and characters to further exploit all the available information which showed to be important for these languages. For the evaluation, we manually translate the wordsim-353 word pairs dataset from English into Yorùbá and Twi.
We extend the analysis to contextual word embeddings and evaluate multilingual BERT on a named entity recognition task. For this, we annotate with named entities the Global Voices corpus for Yorùbá. As output of the work, we provide corpora, embeddings and the test suits for both languages.
@inproceedings{alabiEtal:2020:LREC,
title = "Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of Yor\`ub\'a and Twi",
author = "Jesujoba O. Alabi, Kwabena Amponsah-Kaakyire, David I. Adelani and Cristina Espa{\~n}a-Bonet",
booktitle = "Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020)",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association (ELRA)",
url = "https://www.aclweb.org/anthology/2020.lrec-1.335/",
doi = "",
pages = "2754--2762"
}
GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies
Marta R. Costa-jussà, Pau Li Lin and Cristina España-Bonet
Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), pages 4081-4088, Marseille, France, May 2020.
[
Abstract
PDF
BibTeX
]
We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite the gender inequalities present in Wikipedia, the toolkit has been designed to extract corpus balanced in gender. While our toolkit is customizable to any number of languages (and to other domains than biographical entries), in this work we present a corpus of 2,000 sentences in English, Spanish and Catalan, which has been post-edited by native speakers to become a high-quality dataset for machine translation evaluation. While GeBioCorpus aims at being one of the first non-synthetic gender-balanced test datasets, GeBioToolkit aims at paving the path to standardize procedures to produce gender-balanced datasets.
@inproceedings{ruizEtal:2020:LREC,
title = "GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies",
author = "Costa-juss{\`a}, Marta and Li Lin, Pau and Espa{\~n}a-Bonet, Cristina",
booktitle = "Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020)",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association (ELRA)",
url = "https://www.aclweb.org/anthology/2020.lrec-1.502/",
doi = "",
pages = "4081--4088"
}
2019
Analysing Coreference in Transformer Outputs
Ekaterina Lapshinova-Koltunski, Cristina España-Bonet and Josef van Genabith
Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), pages 1-12, Hong Kong, November 2019.
[
Abstract
PDF
BibTeX
]
We analyse coreference phenomena in three neural machine translation systems trained with different data settings with or without access to explicit intra- and cross-sentential anaphoric information. We compare system performance on two different genres: news and TED talks. To do this, we manually annotate (the possibly incorrect) coreference chains in the MT outputs and evaluate the coreference chain translations. We define an error typology that aims to go further than pronoun translation adequacy and includes types such as incorrect word selection or missing words. The features of coreference chains in automatic translations are also compared to those of the source texts and human translations. The analysis shows stronger potential translationese effects in machine translated outputs than in human translations.
@inproceedings{lapshinovaEtal:2019:DiscoMT,
title = "Analysing Coreference in Transformer Outputs",
author = "Lapshinova-Koltunski, Ekaterina and Espa{\~n}a-Bonet, Cristina and van Genabith, Josef",
booktitle = "Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)",
month = nov,
year = "2019",
address = "Hong Kong",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-6501",
doi = "10.18653/v1/D19-6501",
pages = "1--12"
}
Context-Aware Neural Machine Translation Decoding
Eva Martínez Garcia, Carles Creus and Cristina España-Bonet
Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), pages 13-23, Hong Kong, November 2019.
[
Abstract
PDF
BibTeX
]
This work presents a decoding architecture that fuses the information from a neural translation model and the context semantics enclosed in a semantic space language model based on word embeddings. The method extends the beam search decoding process and therefore can be applied to any neural machine translation framework. With this, we sidestep two drawbacks of current document-level systems: (i) we do not modify the training process so there is no increment in training time, and (ii) we do not require document-level an-notated data. We analyze the impact of the fusion system approach and its parameters on the final translation quality for English-Spanish. We obtain consistent and statistically significant improvements in terms of BLEU and METEOR and we observe how the fused systems are able to handle synonyms to propose more adequate translations as well as help the system to disambiguate among several translation candidates for a word.
@InProceedings{martinezEtAl:DiscoMT:2019,
title = "Context-Aware Neural Machine Translation Decoding",
author = "Mart{\'\i}nez Garcia, Eva and Creus, Carles and Espa{\~n}a-Bonet, Cristina",
booktitle = "Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-6502",
doi = "10.18653/v1/D19-6502",
pages = "13--23"
}
Self-Supervised Neural Machine Translation
Dana Ruiter, Cristina España-Bonet and Josef van Genabith
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Volume 2: Short Papers, pages 1828-1834, Florence, Italy, August 2019.
[
Abstract
PDF
BibTeX
]
We present a simple new method where an emergent NMT system is used for simultaneously selecting training data and learning internal NMT representations. This is done in a self-supervised way without parallel data, in such a way that both tasks enhance each other during training. The method is language independent, introduces no additional hyper-parameters, and achieves BLEU scores of 29.21 (en2fr) and 27.36 (fr2en) on newstest2014 using English and French Wikipedia data for training.
@InProceedings{ruiterEtAl:ACL:2019,
author = {Dana Ruiter and Cristina Espa\~na-Bonet and Josef van Genabith},
title = "{Self-Supervised Neural Machine Translation}",
booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Volume 2: Short Papers. },
key = {ACL 2019},
pages = {1828--1834},
year = {2019},
month = {August},
address = {Florence, Italy},
publisher = {Association for Computational Linguistics}
}
UdS-DFKI Participation at WMT 2019: Low-Resource (en-gu) and Coreference-Aware (en-de) Systems
Cristina España-Bonet, Dana Ruiter and Josef van Genabith
Proceedings of the Fourth Conference on Machine Translation, pages 382-389, Florence, Italy, August 2019.
[
Abstract
PDF
BibTeX
]
This paper describes the UdS-DFKI submission to the WMT2019 news translation task for Gujarati-English (low-resourced pair) and German-English (document-level evaluation). Our systems rely on the on-line extraction of parallel sentences from comparable corpora for the first scenario and on the inclusion of coreference-related information in the training data in the second one.
@InProceedings{espanaEtAl:WMT:2019,
author = {Cristina Espa\~na-Bonet and Dana Ruiter and Josef van Genabith},
title = "{UdS-DFKI Participation at WMT 2019: Low-Resource ($en$--$gu$) and Coreference-Aware ($en$--$de$) Systems}",
booktitle = {Proceedings of the Fourth Conference on Machine Translation},
key = {WMT 2019},
pages = {382--389},
year = {2019},
month = {August},
address = {Florence, Italy},
publisher = {Association for Computational Linguistics}
}
2018
Neural Machine Translation is like a Pig
Cristina España-Bonet
Invited talk at the Deep Learning BCN Symposium, Barcelona, Catalunya, 20th December 2018.
[
Abstract
Slides
]
Neural machine translation systems (NMT) are state-of-the-art for most language pairs, specially for those with a large amount of parallel data available. These systems are expensive to train both in time and resources, but as with pigs, all of its parts can be (re)used afterwards. In this talk I will sketch how and why multilingual word and sentence embeddings obtained from an NMT system can be used for other purposes such as assessing semantic cross-lingual similarities, parallel sentence extraction or cross-lingual information retrieval. Under this perspective, NMT can be seen as an auxiliary task --multilingual by definition-- to obtain multilingual representations in the same way the skip-gram and CBOW tasks were defined to obtain monolingual word embeddings. Following with this analogy, I will compare differences between seq2seq and transformer architectures as two variants for the same goal.
Query Translation for Cross-lingual Search in the Academic Search Engine PubPsych (BEST PAPER AWARD)
Cristina España-Bonet, Juliane Stiller, Roland Ramthun, Josef van Genabith and Vivien Petras
Proceedings of the Metadata and Semantics Research, 12th International Research Conference (MTSR 2018), Limassol, Cyprus, October 2018.
CCIS Vol. 846 Communications in Computer and Information Science (CCIS) book series, Springer
[
Abstract
PDF
BibTeX
]
We describe a lexical resource-based process for query translation of a domain-specific and multilingual academic search engine in psychology, PubPsych. PubPsych queries are diverse in language with a high amount of informational queries and technical terminology. We present an approach for translating queries into English, German, French, and Spanish. We build a quadrilingual lexicon with aligned terms in the four languages using MeSH, Wikipedia and Apertium as our main resources. Our results show that using the quadlexicon together with some simple translation rules, we can automatically translate 85% of translatable tokens in PubPsych queries with mean adequacy over all the translatable text of 1.4 when measured on a 3-point scale [0,1,2].
@InProceedings{espanaBonetEtAl:MTSR:2018,
author = {Cristina Espa{\~n}a-Bonet and Juliane Stiller and Roland Ramthun and Josef van Genabith and Vivien Petras},
title = "{Query Translation for Cross-lingual Search in the Academic Search Engine PubPsych}",
editor="Garoufallou, Emmanouel and Sartori, Fabio and Siatri, Rania and Zervas, Marios",
booktitle="Metadata and Semantic Research",
year="2019",
publisher="Springer International Publishing",
address="Cham",
pages="37--49",
isbn="978-3-030-14401-2"
doi="10.1007/978-3-030-14401-2_4"
}
Neural Machine Translation with Context & Document Information
Cristina España-Bonet
Invited talk at the First International Workshop on Discourse Processing Guangdong University of Foreign Studies, Guangzhou, China, 23th October 2018.
The role of Artifical Intelligence within Natural Language
Cristina España-Bonet
Talk at the Multilingual Public Services in Europe Workshop, EC, Brussels, Belgium, 17th October 2018.
Multilingual Semantic Networks for Data-driven Interlingua Seq2Seq Systems
Cristina España-Bonet and Josef van Genabith
Proceedings of the LREC 2018 MLP-MomenT Workshop (MLP-Moment 2018), pages 8-13, Miyazaki, Japan, May 2018.
[
Abstract
PDF
Slides
BibTeX
]
Neural machine translation systems are state-of-the-art for most language pairs despite the fact that they are relatively recent and that because of this there is likely room for even further improvements. Here, we explore whether, and if so, to what extent, semantic networks can help improve NMT. In particular, we (i) study the contribution of the nodes of the semantic network, synsets, as factors in multilingual neural translation engines. We show that they improve a state-of-the-art baseline and that they facilitate the translation from languages that have not been seen at all in training (beyond zero-shot translation). Taking this idea to an extreme, we (ii) use synsets as the basic unit to encode the input and turn the source language into a data-driven interlingual language. This transformation boosts the performance of the neural system for unseen languages achieving an improvement of 4.9/6.3 and 8.2/8.7 points of BLEU/METEOR for fr2en and es2en respectively when neither corpora in fr or es has been used. In (i), the enhancement comes about because cross-language synsets help to cluster words by semantics irrespective of their language and to map the unknown words of a new language into the multilingual clusters. In (ii), because with the data-driven interlingua there is no unknown language if it is covered by the semantic network. However, non-content words are not represented in the semantic network, and a higher level of abstraction is still needed in order to go a step further and train these systems with only monolingual corpora for example.
@InProceedings{espanaVanGenabith:LREC:2018,
author = {Cristina Espa\~na-Bonet and Josef van Genabith},
title = "{Multilingual Semantic Networks for Data-driven Interlingua Seq2Seq Systems}",
booktitle = {Proceedings of the LREC 2018 MLP-MomenT Workshop},
key = {MLP-MomenT 2018},
pages = {8--13},
year = {2018},
month = {May},
Address = {Miyazaki, Japan}
}
2017
Going beyond zero-shot MT: combining phonological, morphological and semantic factors. The UdS-DFKI System at IWSLT 2017
Cristina España-Bonet and Josef van Genabith
Proceedings of the 14th International Workshop on Spoken Language Translation (IWSLT), pages 15-22, Tokyo, Japan, December 2017.
[
Abstract
PDF
Poster
BibTeX
]
This paper describes the UdS-DFKI participation to the multilingual task of the IWSLT Evaluation 2017. Our approach is based on factored multilingual neural translation systems following the small data and zero-shot training conditions. Our systems are designed to fully exploit multilinguality by including factors that increase the number of common elements among languages such as phonetic coarse encodings and synsets, besides shallow part-of-speech tags, stems and lemmas. Document level information is also considered by including the topic of every document. This approach improves a baseline without any additional factor for all the language pairs and even allows beyond-zero-shot translation. That is, the translation from unseen languages is possible thanks to the common elements —especially synsets in our models— among languages.
@InProceedings{espanaVanGenabith:IWSLT:2017,
author = {Cristina Espa\~na-Bonet and Josef van Genabith},
title = "{Going beyond zero-shot MT: combining phonological, morphological and semantic factors. The UdS-DFKI System at IWSLT 2017}",
booktitle = {Proceedings of the 14th International Workshop on Spoken Language Translation (IWSLT)},
key = {IWSLT 2017},
pages = {15--22},
year = {2017},
month = {December},
Address = {Tokyo, Japan}
}
Multilingual Natural Language Processing
Cristina España-Bonet
Talk at RICOH Institute of ICT, Tokyo, Japan, 11th December 2017.
An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification
Cristina España-Bonet, Ádám Csaba Varga, Alberto Barrón-Cedeño and Josef van Genabith
IEEE Journal of Selected Topics in Signal Processing, volume 11, number 8, pages 1340-1350, IEEE, December 2017.
[
Abstract
PDF
BibTeX
HTML
]
End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with a large amount of parallel data available. Beside this palpable improvement, neural networks embrace several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words -or sentences- which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present work is two-fold. First, we systematically study the context vectors, i.e. output of the encoder, and their prowess as an interlingua representation of a sentence. Their quality and effectiveness are assessed by similarity measures across translations, semantically related, and semantically unrelated sentence pairs. Second, and as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F1=98.2% on data from a shared task when using only context vectors. F1 reaches 98.9% when complementary similarity measures are used.
@article{espana-bonetElAl:2017,
author = {Cristina Espa{\~{n}}a{-}Bonet and
{\'{A}}d{\'{a}}m Csaba Varga and
Alberto Barr{\'{o}}n{-}Cede{\~{n}}o and
Josef van Genabith},
title = {An Empirical Analysis of NMT-Derived Interlingual Embeddings and their
Use in Parallel Sentence Identification},
journal = {IEEE Journal of Selected Topics in Signal Processing},
volume = {11},
number = {8},
month = {December},
pages = {1340--1350},
year = {2017},
doi = {10.1109/JSTSP.2017.2764273}
}
Learning Bilingual Projections of Embeddings for Vocabulary Expansion in Machine Translation
Pranava Swaroop Madhyastha and Cristina España-Bonet
Proceedings of the 2nd Workshop on Representation Learning for NLP (ACL Workshop RepL4NLP-2017), pages 139-145, Vancouver, Canada, August 2017.
[
Abstract
PDF
Poster
BibTeX
]
We propose a simple log-bilinear softmax-based model to deal with vocabulary expansion in machine translation. Our model uses word embeddings trained on significantly large unlabelled monolingual corpora and learns over a fairly small, word-to-word bilingual dictionary. Given an out-of-vocabulary source word, the model generates a probabilistic list of possible translations in the target language using the trained bilingual embeddings. We integrate these translation options into a standard phrase-based statistical machine translation system and obtain consistent improvements in translation quality on the English–Spanish language pair. When tested over an out-of-domain test set, we get a significant improvement of 3.9 BLEU points.
@inProceedings{MadhyasthaEspana:2017,
author = {Pranava Swaroop Madhyastha and Cristina Espa{\~{n}}a{-}Bonet},
title = {Learning Bilingual Projections of Embeddings for Vocabulary Expansion in Machine Translation},
booktitle = {Proceedings of the 2nd Workshop on Representation Learning for NLP. ACL Workshop on Representation Learning for NLP (RepL4NLP-2017)},
pages = {139--145},
year = {2017},
month = {August}
address = {Vancouver, Canada},
publisher = {Association for Computational Linguistics},
language = {english},
}
Lump at SemEval-2017 Task 1: Towards an Interlingua Semantic Similarity
Cristina España-Bonet and Alberto Barrón-Cedeño
Proceedings of the 11th International Workshop on Semantic Evaluation (ACL Workshop SemEval-2017), pages 144-149, Vancouver, Canada, August 2017.
[
Abstract
PDF
BibTeX
arXiv
]
This is the Lump team participation at SemEval 2017 Task 1 on Semantic Textual Similarity. Our supervised model relies on features which are multilingual or interlingual in nature. We include lexical similarities, cross-language explicit semantic analysis, internal representations of multilingual neural networks and interlingual word embeddings. Our representations allow to use large datasets in language pairs with many instances to better classify instances in smaller language pairs avoiding the necessity of translating into a single language. Hence we can deal with all the languages in the task: Arabic, English, Spanish, and Turkish.
@InProceedings{EspanaBarron:2017,
author = {{Espa{\~n}a-Bonet}, Cristina and {Barr\'on-Cede{\~n}o}, Alberto},
title = "{Lump at SemEval-2017 Task 1: Towards an Interlingua Semantic Similarity}",
booktitle = "{Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)}",
pages = {144--149},
year = {2017},
month = {August}
address = {Vancouver, Canada},
publisher = {Association for Computational Linguistics},
language = {english},
url = {http://www.aclweb.org/anthology/S17-2019}
}
Using Word Embeddings to Enforce Document-Level Lexical Consistency in Machine Translation
Eva Martínez Garcia, Carles Creus, Cristina España-Bonet, Lluís Màrquez
The 20th Annual Conference of the European Association for Machine Translation, Prague, Czech Republic. The Prague Bulletin of Mathematical Linguistics, Vol. 108, pages 85-96, June 2017.
[
Abstract
PDF
BibTeX
arXiv
]
We integrate new mechanisms in a document-level machine translation decoder to improve the lexical consistency of document translations. First, we develop a document-level feature designed to score the lexical consistency of a translation. This feature, which applies to words that have been translated into different forms within the document, uses word embeddings to measure the adequacy of each word translation given its context. Second, we extend the decoder with a new stochastic mechanism that, at translation time, allows to introduce changes in the translation oriented to improve its lexical consistency. We evaluate our system on English-Spanish document translation, and we conduct automatic and manual assessments of its quality. The automatic evaluation metrics, applied mainly at sentence level, do not reflect significant variations. On the contrary, the manual evaluation shows that the system dealing with lexical consistency is preferred over both a standard sentence-level and a standard document-level phrase-based MT systems.
@Article{eamt_martinezetal:2017,
author = {{Mart\'inez}, Eva and {Creus}, Carles and {Espa{\~n}a-Bonet}, Cristina and {M\`arquez}, Llu\'{i}s},
title = {Using Word Embeddings to Enforce Document-Level Lexical Consistency in Machine Translation},
journal = {The 20th Annual Conference of the European Association for Machine Translation.
The Prague Bulletin of Mathematical Linguistics},
pages = {85--96},
volume = {108},
year = {2017},
month = {June},
language = {english}
}
2016
Automatic Speech Recognition with Deep Neural Networks for Impaired Speech
Cristina España-Bonet and José A. R. Fonollosa
Chapter in Advances in Speech and Language Technologies for Iberian Languages, part of the series Lecture Notes in Artificial Intelligence. In A. Abad et al. (Eds.). IberSPEECH 2016, LNAI 10077, Chapter 10, pages 97-107, October 2016.
[
Abstract
PDF
BibTeX
arXiv
]
Automatic Speech Recognition has reached almost human performance in some controlled scenarios. However, recognition of impaired speech is a difficult task for two main reasons: data is (i) scarce and (ii) heterogeneous. In this work we train different architectures on a database of dysarthric speech. A comparison between architectures shows that, even with a small database, hybrid DNN-HMM models outperform classical GMM-HMM according to word error rate measures. A DNN is able to improve the recognition word error rate a 13% for subjects with dysarthria with respect to the best classical architecture. This improvement is higher than the one given by other deep neural networks such as CNNs, TDNNs and LSTMs. All the experiments have been done with the Kaldi toolkit for speech recognition for which we have adapted several recipes to deal with dysarthric speech and work on the TORGO database. These recipes are publicly available.
@inBook{EspanaFonollosa:2016,
author = {Espa\~{n}a-Bonet, Cristina and Fonollosa, Jos\'{e} A. R.},
title = {Automatic Speech Recognition with Deep Neural Networks for Impaired Speech},
booktitle = {Advances in Speech and Language Technologies for Iberian Languages},
series = {Lecture Notes in Artificial Intelligence},
month = {October},
year = {2016},
publisher = {Springer International Publishing AG},
editor = {Abad, A. and Ortega, A. and Teixeira, A.J.d.S. and Garcia Mateo, C. and Mart\'{i}nez Hinarejos, C.D.
and Perdig\~{a}o, F. and Batista, F. and Mamede, N. (Eds.)},
pages = {97--107},
chapter = 10,
isbn = {978-3-319-49169-1},
doi = {10.1007/978-3-319-49169-1$\_$10},
url = {http://www.springer.com/us/book/9783319491684}
}
The TALP-UPC Spanish-English WMT Biomedical Task: Bilingual Embeddings and Char-based Neural Language Model Rescoring in a Phrase-based System
Marta Ruiz Costa-jussà, Cristina España-Bonet, Pranava Madhyastha, Carlos Escolano and José A. R. Fonollosa
Proceedings of the First Conference on Machine Translation (WMT 2016), pages 463-468, Berlin, Germany, August 2016.
[
Abstract
PDF
BibTeX
arXiv
]
This paper describes the TALP-UPC system in the Spanish-English WMT 2016 biomedical shared task. Our system is a standard phrase-based system enhanced with vocabulary expansion using bilingual word embeddings and a character-based neural language model with rescoring. The former focuses on resolving out-of-vocabulary words, while the latter enhances the fluency of the system. The two modules progressively improve the final translation as measured by a combination of several lexical metrics.
@InProceedings{costajussaEtal:WMT:2016,
author = {Costa-juss\`{a}, Marta R. and Espa\~{n}a-Bonet, Cristina and Madhyastha, Pranava and
Escolano, Carlos and Fonollosa, Jos\'{e} A. R.},
title = {The TALP--UPC Spanish--English WMT Biomedical Task: Bilingual Embeddings and
Char-based Neural Language Model Rescoring in a Phrase-based System},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
year = {2016},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {463--468},
url = {http://www.aclweb.org/anthology/W/W16/W16-2336}
}
Resolving Out-of-Vocabulary Words with Bilingual Embeddings in Machine Translation
Pranava Madhyastha and Cristina España-Bonet
CoRR abs/1608.01910, August 2016.
[
Abstract
PDF
BibTeX
arXiv
]
Out-of-vocabulary words account for a large proportion of errors in machine translation systems, especially when the system is used on a different domain than the one where it was trained. In order to alleviate the problem, we propose to use a log-bilinear softmax-based model for vocabulary expansion, such that given an out-of-vocabulary source word, the model generates a probabilistic list of possible translations in the target language. Our model uses only word embeddings trained on significantly large unlabelled monolingual corpora and trains over a fairly small, word-to-word bilingual dictionary. We input this probabilistic list into a standard phrase-based statistical machine translation system and obtain consistent improvements in translation quality on the English-Spanish language pair. Especially, we get an improvement of 3.9 BLEU points when tested over an out-of-domain test set.
@article{MadhyasthaEspana:2016,
author = {Pranava Swaroop Madhyastha and Cristina Espa{\~{n}}a{-}Bonet},
title = {Resolving Out-of-Vocabulary Words with Bilingual Embeddings in Machine Translation},
journal = {CoRR},
volume = {abs/1608.01910},
year = {2016},
url = {http://arxiv.org/abs/1608.01910}
}
Hybrid Machine Translation Overview
Cristina España-Bonet, Marta Ruiz Costa-jussà
Chapter in Hybrid Approaches to Machine Translation, part of the series Theory and Applications of Natural Language Processing, pages 1-24
[
Abstract
PDF
BibTeX
]
This survey chapter provides an overview of the recent research in hybrid Machine Translation (MT). The main MT paradigms are sketched and their integration at different levels of depth is described starting with system combination techniques and followed by integration strategies led by rule-based and statistical systems. System combination does not involve any hybrid architecture since it combines translation outputs. It can be done with different granularities that include sentence, sub-sentential and graph-levels. When considering a deeper integration, architectures guided by the rule-based approach introduce statistics to enrich resources, modules or the backbone of the system. Architectures guided by the statistical approach include rules in pre-/post-processing or at a inner level which means including rules or dictionaries in the core system. This chapter overviewing hybrid MT puts in context, introduces, and motivates the subsequent chapters that constitute this book.
@Inbook{EspanaBonetEtal:2016,
author={Espa{\~{n}}a-Bonet, Cristina and Costa-juss{\`a}, Marta R.},
editor={Costa-juss{\`a}, R. Marta and Rapp, Reinhard and Lambert, Patrik
and Eberle, Kurt and Banchs, E. Rafael and Babych, Bogdan},
title="{Hybrid Machine Translation Overview}",
bookTitle="{Hybrid Approaches to Machine Translation"},
year={2016},
publisher={Springer International Publishing},
pages={1--24},
isbn={978-3-319-21311-8},
doi={10.1007/978-3-319-21311-8_1},
url={http://dx.doi.org/10.1007/978-3-319-21311-8_1}
}
TweetMT: A Parallel Microblog Corpus
Iñaki San Vicente, Iñaki Alegria, Cristina España-Bonet, Pablo Gamallo, Hugo Gonçalo Oliveira, Eva Martínez Garcia, Antonio Toral, Arkaitz Zubiaga and Nora Aranberri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pages 2936-2941, Portoroz, Slovenia, May 2016.
[
Abstract
PDF
BibTeX
arXiv
]
We introduce TweetMT, a parallel corpus of tweets in four language pairs that combine five languages (Spanish from/to Basque, Catalan, Galician and Portuguese), all of which have an official status in the Iberian Peninsula. The corpus has been created by combining automatic collection and crowdsourcing approaches, and it is publicly available. It is intended for the development and testing of microtext machine translation systems. In this paper we describe the methodology followed to build the corpus, and present the results of the shared task in which it was tested.
@InProceedings{LRECSanVicente:2016,
author = {{San Vicente}, I\~naki and {Alegr\'ia}, I{\~n}aki and {Espa{\~n}a-Bonet}, Cristina and {Gamallo}, Pablo and
{Gon\c{c}alo Oliveira}, Hugo and {Mart\'inez Garc\'ia}, Eva and
{Toral}, Antonio and {Zubiaga}, Arkaitz and {Aranberri}},
title = {TweetMT: A Parallel Microblog Corpus},
booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
pages = {2936--2941},
year = {2016},
month = {may},
date = {23--28},
location = {Portoroz, Slovenia},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-9-1},
language = {english}
}
}
Resolving Out-of-Vocabulary Words with Bilingual Word Embeddings in Machine Translation
Cristina España-Bonet
Invited talk at Saarland University, DFKI, Saarbrücken, April 29th, 2016.
[
Abstract
Slides
]
Data-driven machine translation systems are able to translate words that have been seen in the training parallel corpora. However, translating out-of-vocabulary words (OOV) is still a major challenge, even for the best performing systems. In this talk, I will show a method that takes advantage of distributional semantic representations of words —previously estimated on large monolingual corpora—, to obtain a probabilistic distribution of translation options for a given OOV. The monolingual embeddings are projected into a bilingual low-dimensional space by learning a log-linear model over a small parallel dictionary. Within the translation setting, the probabilistic distribution interacts with other components (e.g., a language model), which allows for selecting the best translation option among all the possibilities, even if a word has not been seen in the parallel corpus. Our model achieves significant improvements in terms of translation quality, especially for out-of-domain data, in which out-of-vocabulary content words are expected. I will show here how and when our method boosts the performance of a translation system, and present our recent participation with this approach in the Biomedical Translation Task in WMT16.
2015
WikiParable - Data Categorisation Platform (Version 1.0)
Cristina España-Bonet
Technical Report, Universitat Politècnica de Catalunya, Computer Science Department, November 2015.
[
Abstract
PDF
BibTeX
arXiv
]
This document describes WikiParable, an on-line platform designed for data categorisation. Its purpose is twofold and the tool can be used both to annotate data and to evaluate automatic categorisations. As a main use case and aim of the implementation, the interface has been used within the TACARDI project to annotate Wikipedia articles in different domains and languages.
@TechReport{WikiParableV1.0,
author = {{Espa{\~n}a-Bonet}, Cristina}
title = {WikiParable -- Data Categorisation Platform (Version 1.0) },
year = {2015},
month = {November}
date = {16},
institution = {Universitat Polit\`ecnica de Catalunya, Computer Science Department},
url = {http://hdl.handle.net/2117/79539},
language = {english}
}
Journey through Natural Language Processing
Cristina España-Bonet
Poster at Google NLP PhD Summit 2015, Zurich, Switzerland, September 2015.
[
Abstract
PDF
BibTeX
arXiv
]
Summary of some of the work I have been involved in in the last three years.
@Misc{CEBjourney,
author = {{Espa{\~n}a-Bonet}, Cristina}
title = {Journey through Natural Language Processing},
howpublished = {Poster},
year = {2015},
month = {September}
date = {23},
address = {Zurich, Switzerland},
language = {english}
}
Overview of TweetMT: A Shared Task on Machine Translation of Tweets at SEPLN 2015
Iñaki Alegria, Nora Aranberri, Cristina España-Bonet, Pablo Gamallo, Hugo Gonçalo Oliveira, Eva Martínez Garcia, Iñaki San Vicente, Antonio Toral, Arkaitz Zubiaga
Proceedings of the Tweet Translation Workshop, at "XXXI Congreso de la Sociedad Española de Procesamiento de lenguaje natural" and CEUR Workshop Proceedings, volume 1445, pages 8-19, Alacant, Spain, September 2015.
[
Abstract
PDF
BibTeX
arXiv
Slides
]
This article presents an overview of the shared task that took place as part of the TweetMT workshop held at SEPLN 2015. The task consisted in translating collections of tweets from and to several languages. The article outlines the data collection and annotation process, the development and evaluation of the shared task, as well as the results achieved by the participants.
@InProceedings{tweetMT_overview,
author = {{Alegr\'ia}, I{\~n}aki and {Aranberri}, Nora and {Espa{\~n}a-Bonet}, Cristina and {Gamallo}, Pablo and
{Gon\c{c}alo Oliveira}, Hugo and {Mart\'inez Garc\'ia}, Eva and {San Vicente}, I\~naki and
{Toral}, Antonio and {Zubiaga}, Arkaitz},
title = {Overview of TweetMT: A Shared Task on Machine Translation of Tweets at SEPLN 2015},
booktitle = {Proceedings of the Tweet Translation Workshop, at "XXXI Congreso de la Sociedad Espa{\~n}ola de
Procesamiento de lenguaje natural" and CEUR Workshop Proceedings.},
pages = {8--19},
volume = {1445},
year = {2015},
month = {September}
date = {15},
address = {Alacant, Spain},
language = {english}
}
The UPC TweetMT participation: Translating Formal Tweets using Context Information
Eva Martínez Garcia, Cristina España-Bonet, Lluís Màrquez
Proceedings of the Tweet Translation Workshop, at "XXXI Congreso de la Sociedad Española de Procesamiento de lenguaje natural" and CEUR Workshop Proceedings, volume 1445, pages 25-32, Alacant, Spain, September 2015.
[
Abstract
PDF
BibTeX
arXiv
]
In this paper, we describe the UPC systems participating in the TweetMT shared task. We developed two main systems that were applied to the Spanish–Catalan language pair: a state-of-the-art phrase-based statistical machine translation system and a context-aware system. In the second approach, we define "context" for a tweet as the tweets of a user produced in the same day, and also, we study the impact of this kind of information in the final translations when using a document-level decoder. A variant of this approach considers also semantic information from bilingual embeddings.
@InProceedings{tweetMT_martinezetal15,
author = {{Mart\'inez}, Eva and {Espa{\~n}a-Bonet}, Cristina and {M\`arquez}, Llu\'{i}s},
title = {The UPC TweetMT participation: Translating Formal Tweets using Context Information},
booktitle = {Proceedings of the Tweet Translation Workshop, at "XXXI Congreso de la Sociedad Espa{\~n}ola de
Procesamiento de lenguaje natural" and CEUR Workshop Proceedings.},
pages = {25--32},
volume = {1445},
year = {2015},
month = {September}
date = {15},
address = {Alacant, Spain},
language = {english}
}
A Factory of Comparable Corpora from Wikipedia
Alberto Barrón-Cedeño, Cristina España-Bonet, Josu Boldoba, Lluís Màrquez
Proceedings of the 8th Workshop on Building and Using Comparable Corpora (BUCC), pages 3-13, Beijing, China, July 2015.
[
Abstract
PDF
BibTeX
arXiv
]
Multiple approaches to grab comparable data from the Web have been developed up to date. Nevertheless, coming out with a high-quality comparable corpus of a specific topic is not straightforward. We present a model for the automatic extraction of comparable texts in multiple languages and on specific topics from Wikipedia. In order to prove the value of the model, we automatically extract parallel sentences from the comparable collections and use them to train statistical machine translation engines for specific domains. Our experiments on the English-Spanish pair in the domains of Computer Science, Science, and Sports show that our in-domain translator performs significantly better than a generic one when translating in-domain Wikipedia articles. Moreover, we show that these corpora can help when translating out-of-domain texts.
@InProceedings{Barronetal:2015,
author = {{Barr\'on-Cede{\~n}o}, Alberto and {Espa{\~n}a-Bonet}, Cristina and
{Boldoba}, Josu and {M\`arquez}, Llu\'{i}s},
title = "{A Factory of Comparable Corpora from Wikipedia}",
booktitle = "{Proceedings of the 8th Workshop on Building and Using Comparable Corpora (BUCC)}",
pages = {3--13},
year = {2015},
month = {July}
date = {30},
address = {Beijing, China},
language = {english},
url = {http://www.aclweb.org/anthology/W15-3402}
}
Document-Level Machine Translation with Word Vector Models
Eva Martínez Garcia, Cristina España-Bonet, Lluís Màrquez
Proceedings of the 18th Annual Conference of the European Association for Machine Translation (EAMT), pages 59-66, Antalya, Turkey, May 2015.
[
Abstract
PDF
BibTeX
arXiv
]
In this paper we apply distributional semantic information to
document-level machine translation. We train monolingual and bilingual word vector models on large corpora and we
evaluate them first in a cross-lingual lexical substitution task and then on the final translation task.
For translation, we incorporate the semantic information in a statistical document-level decoder (Docent), by enforcing
translation choices that are semantically similar to the context. As expected, the bilingual word vector models are more
appropriate for the purpose of translation. The final document-level translator incorporating the semantic model
outperforms the basic Docent (without semantics) and also performs slightly over a standard sentence-level SMT system
in terms of ULC (the average of a set of standard automatic evaluation metrics for MT). Finally, we also present some
manual analysis of the translations of some concrete documents.
@InProceedings{eamt15_martinezetal15,
author = {{Mart\'inez}, E. and {Espa{\~n}a-Bonet}, C. and {M\`arquez}, L.},
title = {Document-Level Machine Translation with Word Vector Models},
booktitle = {Proceedings of the 18th Annual Conference of the European Association for Machine Translation (EAMT)},
pages = {59--66},
year = {2015},
month = {May}
date = {13},
address = {Antalya, Turkey},
language = {english}
}
A broad stroke on Machine Translation Evaluation
Cristina España-Bonet
Invited talk at the Faculty of Informatics (UPV/EHU) Donosti, March 13, 2015.
[
Abstract
Slides
]
This broad stroke on Machine Translation Evaluation overviews current approaches and methodologies.
MT evaluation is put in context and we argue why it must be considered a delicate topic. The most
common manual and automatic evaluation measures are described and new approaches sketched. Finally,
several tools for MT evaluation are introduced paying special attention to the Asiya Toolkit.
2014
Word's Vector Representations meet Machine Translation
Eva Martínez Garcia, Cristina España-Bonet, Jörg Tiedemann, Lluís Màrquez
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 132-134, October 25, 2014, Doha, Qatar.
[
Abstract
PDF
BibTeX
arXiv
]
Distributed vector representations of words are useful in various NLP tasks.
We briefly review the CBOW approach and propose a bilingual application of
this architecture with the aim to improve consistency and coherence of Machine
Translation. The primary goal of the bilingual extension is to handle ambiguous
words for which the different senses are conflated in the monolingual setup.
@InProceedings{sst8_martinezetal14,
author = {{Mart\'inez}, E. and {Espa{\~n}a-Bonet}, C. and {Tiedemann}, J. and {M\`arquez}, L.},
title = {Word's Vector Representations meet Machine Translation},
booktitle = {Proceedings of the eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8)},
pages = {132--134},
year = {2014},
month = {October}
date = {25},
address = {Doha, Qatar},
language = {english}
}
A hybrid machine translation architecture guided by syntax
Gorka Labaka, Cristina España-Bonet, Lluís Màrquez, Kepa Sarasola
Machine Translation Journal, Vol. 28, Issue 2, pages 91-125, October, 2014.
[
Abstract
PDF
BibTeX
arXiv
]
This article presents a hybrid architecture which combines rule-based machine translation (RBMT) with phrase-based statistical machine translation (SMT). The hybrid translation system is guided by the rule-based engine. Before the transfer step, a varied set of partial candidate translations is calculated with the SMT system and used to enrich the tree-based representation with more translation alternatives. The final translation is constructed by choosing the most probable combination among the available fragments using monotone statistical decoding following the order provided by the rule-based system. We apply the hybrid model to a pair of distantly related languages, Spanish and Basque, and perform extensive experimentation on two different corpora. According to our empirical evaluation, the hybrid approach outperforms the best individual system across a varied set of automatic translation evaluation metrics. Following some output analysis to better understand the behaviour of the hybrid system, we explore the possibility of adding alternative parse trees and extra features to the hybrid decoder. Finally, we present a twofold manual evaluation of the translation systems studied in this paper, consisting of (i) a pairwise output comparison and (ii) a individual task-oriented evaluation using HTER. Interestingly, the manual evaluation shows some contradictory results with respect to the automatic evaluation; humans tend to prefer the translations from the RBMT system over the statistical and hybrid translations.
@article{labakaetal14,
author = {Labaka, Gorka and Espa{\~n}a-Bonet, Cristina and M\`arquez, Llu\'is and Sarasola, Kepa},
title = {A hybrid machine translation architecture guided by syntax},
journal = {Machine Translation},
doi = {10.1007/s10590-014-9153-0},
volume = 28,
issue = 2,
pages = {91-125},
year = {2014},
month = {October},
issn = {0922-6567},
url = {http://dx.doi.org/10.1007/s10590-014-9153-0},
publisher = {Springer Netherlands}
}
Document-Level Machine Translation as a Re-translation Process
Eva Martínez Garcia, Cristina España-Bonet, Lluís Màrquez
Procesamiento del Lenguaje Natural, 53, 103-110. September, 2014
[
Abstract
PDF
BibTeX
arXiv
]
Most of the current Machine Translation systems are designed to translate a document sentence by sentence ignoring discourse information and producing incoherencies in the final translations. In this paper we present some document-level-oriented post-processes to improve translations' coherence and consistency. Incoherences are detected and new partial translations are proposed. The work focuses on studying two phenomena: words with inconsistent translations throughout a text and also, gender and number agreement among words. Since we deal with specific phenomena, an automatic evaluation does not reflect significant variations in the translations. However, improvements are observed through a manual evaluation.
@article{martinez14,
author = {{Mart\'inez}, E. and {Espa{\~n}a-Bonet}, C. and {M\`arquez}, L.},
title = {Document-Level Machine Translation as a Re-translation Process},
journal = {Procesamiento del Lenguaje Natural},
volume = 53,
pages = {103--110},
year = {2014},
month = {September}
}
Statistical Machine Translation and Automatic Evaluation
Cristina España-Bonet and Meritxell Gonzàlez
Tutorial at the 9th edition of the Language Resources and Evaluation Conference, Reykjavik, May 2014.
[
Abstract
Slides Part I
Slides Part II
BibTeX
]
The tutorial is divided in two main parts. The main objective of the first part is to get to know the fundamentals behind the three modules of a statistical system: the language model, the translation model and the decoding or search for the best translation.
The presentation, although theoretical, is focused on understanding how standard software such as SRILM [Stolcke, 2002] and Moses [Koehn et al., 2007] work, what's the logic behind them, so that it is easy to understand the extensions and modifications available.
We also devote the second part of the tutorial to see how these systems, and machine translation systems in general, are evaluated automatically. Machine translation evaluation is a delicate topic. Here we will put the evaluation into context, describe in detail the standard metrics and overview other existing possibilities and paradigms such as linguistically motivated measures and confidence estimation.
Both parts end with a video devoted to reproduce how to build in practice a phrase-based statistical machine translation system (PartI) and how to deeply evaluate translation systems (PartII).
Webpage: http://slifer.lsi.upc.edu/lrec-mttutorial
@Unpublished{tutorialLREC14,
author = {{Espa{\~n}a-Bonet}, C. and {Gonz\`alez}, M.},
title = {Statistical Machine Translation and Automatic Evaluation},
booktitle = {Tutorial at the Ninth International Conference on Language Resources and Evaluation (LREC'14)},
url = {http://slifer.lsi.upc.edu/lrec-mttutorial}
year = {2014},
month = {may},
date = {26--31},
address = {Reykjavik, Iceland},
language = {english}}
2013
Wikicardi: Hacia la extracción de oraciones paralelas de Wikipedia
Josu Boldoba, Alberto Barrón-Cedeño, Cristina España-Bonet
Research Report LSI-14-3-R
[
Abstract
PDF
BibTeX
arXiv
]
Uno de los objetivos del proyecto Tacardi (TIN2012-38523-C02-00) consiste en extraer oraciones paralelas de corpus comparables para enriquecer y adaptar traductores automáticos. En esta investigación usamos un subconjunto de Wikipedia como corpus comparable. En este reporte se describen nuestros avances con respecto a la extracción de fragmentos paralelos de Wikipedia. Primero, discutimos cómo hemos definido los tres dominios de interés -ciencia, informática y deporte-, en el marco de la enciclopedia y cómo hemos extraído los textos y demás datos necesarios para la caracterización de los artículos en las distintas lenguas. Después discutimos brevemente los modelos que usaremos para identificar oraciones paralelas y damos sólo una muestra de algunos resultados preliminares. Los datos obtenidos hasta ahora permiten vislumbran que será posible extraer oraciones paralelas de los dominios de interés a corto plazo, si bien aún no contamos con una estimación del volumen de éstos.
@TechReport{boldobaLSI143R,
author = {{Boldoba}, J. and {Barr\'on-Cede{\~n}o}, A. and {Espa{\~n}a-Bonet}, C.},
title = {Wikicardi: Hacia la extracci\'on de oraciones paralelas de Wikipedia},
institution = {LSI, UPC},
year = {2014},
month = {January},
type = {Research Report},
number = {LSI-14-3-R}
}
Experiments on Document Level Machine Translation
Eva Martínez Garcia, Lluís Màrquez, Cristina España-Bonet
Research Report LSI-14-11-R
[
Abstract
PDF
BibTeX
arXiv
]
.
@TechReport{cespanaLSI093R,
author = {{Mart\'inez}, E. and {M\`arquez}, L. and {Espa{\~n}a-Bonet}, C.},
title = {Experiments on Document Level Machine Translation},
institution = {LSI, UPC},
year = {2014},
month = {January},
type = {Research Report},
number = {LSI-14-11-R}
}
MT Techniques in a Retrieval System of Semantically Enriched Patents
Meritxell Gonzàlez, Maria Mateva, Ramona Enache, Cristina España-Bonet, Lluís Màrquez, Borislav Popov, Aarne Ranta
Proceedings of the Machine Translation Summit XIV, Nice, France, September 2-6, 2013.
[
Abstract
PDF
BibTeX
arXiv
]
This paper focuses on how automatic translation techniques integrated in a
patent retrieval system increase its capabilities and make possible extended
features and functionalities. We describe 1) a novel methodology for natural language
to SPARQL translation based on a grammar–ontology interoperability automation
and a query grammar for the patents domain; 2) a devised strategy for statistical-based
translation of patents that allows to transfer semantic annotations to the target
language; 3) a built-in knowledge representation infrastructure that uses multilingual
semantic annotations; and 4) an on'line application that offers a multilingual
search interface over structural knowledge databases (domain ontologies) and multilingual
documents (biomedical patents) that have been automatically translated.
@InProceedings{MTSpropotype,
author = {{Gonz\`alez}, M. and {Mateva}, M. and {Enache}, R. and {Espa{\~n}a-Bonet}, C. and {M\`arquez}, L. and {Ranta}, A.},
title = {MT Techniques in a Retrieval System of Semantically Enriched Patents},
booktitle = {Proceedings of the Machine Translation Summit XIV},
pages = {-},
year = {2013},
month = {sep},
date = {2},
address = {Nice, France},
language = {english}
}
2012
Deep evaluation of hybrid architectures: Use of different metrics in MERT weight optimization
Cristina España-Bonet, Gorka Labaka, Arantza Díaz de Ilarraza, Lluís Màrquez, Kepa Sarasola
Proceedings of the Free/Open-Source Rule-Based Machine Translation Workshop, Gothenburg 14-15 June, 2012.
[
Abstract
PDF
Slides
BibTeX
arXiv
]
The process of developing hybrid MT systems is usually guided by
an evaluation method used to compare different combinations of basic subsystems. This work presents a deep
evaluation experiment of a hybrid architecture, which combines rule-based and statistical
translation approaches. Differences between the results obtained from automatic and human
evaluations corroborate the inappropriateness of pure lexical automatic evaluation metrics
to compare the outputs of systems that use very different translation approaches. An examination
of sentences with controversial results suggested that linguistic well-formedness
should be considered in the evaluation of output translations. Following this idea, we have
experimented with a new simple automatic evaluation metric, which combines lexical and
PoS information. This measure showed higher agreement with human assessments than
BLEU in a previous study (Labaka et al., 2011). In this paper we have extended its usage throughout
the system development cycle, focusing on its ability to improve parameter optimization.
Results are not totally conclusive. Manual evaluation reflects a slight improvement,
compared to BLEU, when using the proposed measure in system optimization. However,
the improvement is too small to draw any clear conclusion. We believe that we should
first focus on integrating more linguistically representative features in the developing of the
hybrid system, and then go deeper into the development of automatic evaluation metrics.
@InProceedings{SMatxinTeval2,
author = {{Espa{\~n}a-Bonet}, C. and {Labaka}, G. and {D\'iaz de Ilarraza}, A. and {M\`arquez}, L.
and {Sarasola}, K.},
title = {Deep evaluation of hybrid architectures: Use of different metrics in MERT weight optimization},
booktitle = {Proceedings of the Free/Open-Source Rule-Based Machine Translation Workshop},
pages = {65-76},
year = {2012},
month = {jun},
date = {14--15},
address = {Gothenburg},
language = {english}
}
A Hybrid System for Patent Translation
Ramona Enache, Cristina España-Bonet, Aarne Ranta, Lluís Màrquez
Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT12), Trento, Italy, May 8-30, 2012.
[
Abstract
PDF
BibTeX
arXiv
]
This work presents a HMT system for
patent translation. The system exploits the
high coverage of SMT and the high precision of an RBMT system based on GF to
deal with specific issues of the language.
The translator is specifically developed to
translate patents and it is evaluated in the
English-French language pair. Although
the number of issues tackled by the grammar are not extremely numerous yet, both
manual and automatic evaluations consistently show their preference for the hybrid
system in front of the two individual translators.
@InProceedings{enacheEtal12,
author = {{Enache}, R. and {Espa{\~n}a-Bonet}, C. and {Ranta}, A. and {M\`arquez}, L.},
title = {A Hybrid System for Patent Translation},
booktitle = {Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT12)},
pages = {269--276},
year = {2012},
month = {may},
date = {28--30},
address = {Trento, Italy},
language = {english}
}
Context-Aware Machine Translation for Software Localization
Víctor Muntés, Patricia Paladini, Cristina España-Bonet, Lluís Màrquez
Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT12), Trento, Italy, May 8-30, 2012.
[
Abstract
PDF
BibTeX
arXiv
]
Software localization requires translating
short text strings appearing in user interfaces (UI) into several languages. These
strings are usually unrelated to the other
strings in the UI. Due to the lack of semantic context, many ambiguity problems
cannot be solved during translation. However, UI are composed of several visual
components to which text strings are associated. Although this association might
be very valuable for word disambiguation,
it has not been exploited. In this paper,
we present the problem of lack of context awareness for UI localization,
providing real examples and identifying the main
research challenges.
@InProceedings{muntesEtal12,
author = {{Munt\'es}, V. and {Paladini}, P. and {Espa{\~n}a-Bonet}, C. and {M\`arquez}, L.},
title = {Context-Aware Machine Translation for Software Localization},
booktitle = {Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT12)},
pages = {77--80},
year = {2012},
month = {may},
date = {28},
address = {Trento, Italy},
language = {english}
}
Full Machine Translation for Factoid Question Answering
Cristina España-Bonet, Pere R. Comas
Proceedings of the EACL Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT), Avignon, France, April 23, 2012.
[
Abstract
PDF
Slides
BibTeX
arXiv
]
In this paper we present an SMT-based approach to Question Answering (QA). QA
is the task of extracting exact answers in
response to natural language questions. In
our approach, the answer is a translation of
the question obtained with an SMT system.
We use the n-best translations of a given
question to find similar sentences in the
document collection that contain the real
answer. Although it is not the first time that
SMT inspires a QA system, it is the first
approach that uses a full Machine Translation system for generating answers. Our
approach is validated with the datasets of the
TREC QA evaluation.
@InProceedings{espanaComas12,
author = {{Espa{\~n}a-Bonet}, C. and {Comas}, P.R.},
title = {Full Machine Translation for Factoid Question Answering},
booktitle = {Proceedings of the EACL Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT)},
pages = {20--29},
year = {2012},
month = {apr},
date = {23},
address = {Avignon, France},
language = {english}
}
The Patents Retrieval Prototype in the MOLTO project
Milen Chechev, Meritxell Gonzàlez, Lluís Màrquez, Cristina España-Bonet
Proceedings of the World Wide Web 2012, Lyon, France, April 16, 2012.
[
Abstract
PDF
BibTeX
arXiv
]
This paper describes the patents retrieval prototype developed within the MOLTO project. The prototype aims to
provide a multilingual natural language interface for querying the content of patent documents. The developed system
is focused on the biomedical and pharmaceutical domain
and includes the translation of the patent claims and abstracts into English, French and German. Aiming at the
best retrieval results of the patent information and text
content, patent documents are preprocessed and semantically annotated. Then, the annotations are stored and
indexed in an OWLIM semantic repository, which contains a
patent specific ontology and others from the specific domain.
The prototype, accessible online at http://molto-patents.ontotext.com,
presents a multilingual natural language interface to query the retrieval system. In MOLTO, the
multilingualism of the queries is addressed by means of the GF
Tool, which provides an easy way to build and maintain
controlled language grammars for interlingual translation in
limited domains. The abstract representation obtained from
the GF is used to retrieve both the matched RDF1 instances
and the list of patents semantically related to the user's
search criteria. The online interface allows to browse the
retrieved patents and shows on the text the semantic annotations that explain the reason why any particular patent
has matched the user's criteria.
@InProceedings{www12patents,
author = {{Chechev}, M. and {Gonz\`alez}, M. and {M\`arquez}, L. and {Espa{\~n}a-Bonet}, C.},
title = {The Patents Retrieval Prototype in the MOLTO project},
booktitle = {Proceedings of the World Wide Web 2012},
pages = {4-8},
year = {2012},
month = {apr},
date = {16},
address = {Lyon, France},
language = {english}
}
2011
Deep evaluation of hybrid architectures: simple metrics correlated with
human judgments
Gorka Labaka, Arantza Díaz de Ilarraza, Cristina España-Bonet, Lluís Màrquez, Kepa Sarasola
Proceedings of the International Workshop on Using Linguistic Information for Hybrid Machine Translation (LIHMT), Barcelona, November 18th, 2011.
[
Abstract
PDF
Slides
BibTeX
arXiv
]
The process of developing hybrid MT systems
is guided by the evaluation method used to compare different combinations of basic subsystems. This work presents a deep evaluation experiment of a hybrid architecture that tries to get the best of both worlds, rule-based and statistical. In a first evaluation human assessments were used to compare just the single statistical system and the hybrid one, the rule-based system was not compared by hand because the results of automatic evalu ation showed a clear disadvantage. But a second and wider evaluation experiment surprisingly showed that according to human evaluation the best system was the rule-based, the one that achieved the worst results using automatic evaluation. An examination of sentences with controversial results suggested that linguistic well-formedness in the output should be considered in evaluation. After experimenting with 6 possible metrics we conclude that a simple arithmetic mean of BLEU and BLEU calculated on parts of speech of words is clearly a more human conformant metric than lexical metrics alone.
@InProceedings{SMatxinTeval,
author = {{Labaka}, G. and {D\'iaz de Ilarraza}, A. and {Espa{\~n}a-Bonet}, C. and {M\`arquez}, L.
and {Sarasola}, K.},
title = {Deep evaluation of hybrid architectures: simple metrics correlated with human judgments},
booktitle = {Proceedings of the International Workshop on Using Linguistic Information for
Hybrid Machine Translation},
pages = {50-57},
year = {2011},
month = {nov},
date = {19},
address = {Barcelona},
language = {english}
}
Descobrim l'Univers
Cristina España-Bonet
Invited talk at Tertúlies de Literatura Científica, UVic, Vic, October 25th 2011.
[
Abstract
Dossier 1
Dossier 2
Slides
Link video
arXiv
]
En "Descobrim l'Univers" s'han atacat tres aspectes relacionats amb la cosmologia: l'inici de l'Univers, alguns objectes i fenòmens astrofísics, i l'expansió i dimensionalitat de l'Univers. La xerrada es centrarà principalment en aquest últim punt. Partirem de les explicacions amb què us heu pogut familiaritzar amb el dossier subministrat i avançarem cap a entendre el nostre Univers actual, un univers que de manera sorprenent està expandint-se, cada cop es fa més gran, i, a més, ho fa de manera accelerada. L'importància d'aquest descobriment es veu reafirmada pel fet que el premi Nobel de física d'enguany s'ha concedit a tres investigadors que lideren els projectes que ho van anunciar.
Es pot trobar informació addicional al web de les jornades http://tlc.uvic.cat/2011/10/28/activitat-25102011-dra-cristina-espana-upc/.
Patent translation within the MOLTO project
Cristina España-Bonet, Ramona Enache, Adam Slaski, Aarne Ranta,
Lluís Màrquez, Meritxell Gonzàlez
Proceedings of the 4th Workshop on Patent Translation, MT Summit XIII, Xiamen, China, September 23, 2011.
[
Abstract
PDF
Slides
BibTeX
arXiv
]
MOLTO is an FP7 European
project whose goal is to translate texts between multiple
languages in real time with high quality. Patents translation is a case of study where
research is focused on simultaneously obtaining a large coverage without loosing quality
in the translation. This is achieved by hybridising between a grammar-based multilingual
translation system, GF, and a specialised statistical machine translation system.
Moreover, both individual systems by themselves already represent a step forward in the
translation of patents in the biomedical domain, for which the systems have been trained.
@InProceedings{SMatxinT1,
author = {{Espa{\~n}a-Bonet}, C. and {Enache}, R. and {Slaski}, A. and {Ranta}, A.
and {M\`arquez}, L. and {Gonz\`alez}, M.},
title = {Patent translation within the MOLTO project},
booktitle = {Proceedings of the 4th Workshop on Patent Translation, MT Summit XIII},
pages = {70-78},
year = {2011},
month = {sep},
date = {23},
address = {Xiamen, China},
language = {english}
}
Hybrid Machine Translation Guided by a Rule-Based System
Cristina España-Bonet, Gorka Labaka, Arantza Díaz de Ilarraza, Lluís Màrquez, Kepa Sarasola
Proceedings of the 13th Machine Translation Summit, Xiamen, China, September 19-23, 2011.
[
Abstract
PDF
Slides
BibTeX
arXiv
]
This paper presents a machine translation architecture which hybridizes Matxin, a rule-based system, with regular phrase-based Statistical
Machine Translation. In short, the hybrid translation process is guided by the rule-based engine and,
before transference, a set of partial candidate translations provided by SMT subsystems is used to
enrich the tree-based representation. The final hybrid translation is created by choosing the most
probable combination among the available fragments with a statistical decoder in a monotonic way.
We have applied the hybrid model to a pair of distant languages, Spanish and Basque, and according
to our evaluation (both automatic and manual) the hybrid approach significantly outperforms the best SMT system on out-of-domain data.
@InProceedings{SMatxinT1,
author = {{Espa{\~n}a-Bonet}, C. and {Labaka}, G. and {D\'iaz de Ilarraza}, A. and {M\`arquez}, L.
and {Sarasola}, K.},
title = {Hybrid Machine Translation Guided by a Rule-Based System},
booktitle = {Proceedings of the 13th Machine Translation Summit},
pages = {554-561},
year = {2011},
month = {sep},
date = {19-23},
address = {Xiamen, China},
language = {english}
}
Introduction to SMT and its standard tools
Cristina España-Bonet
GF Summer School, Barcelona, August 2011.
[
Abstract
Slides
]
This tutorial is intended to provide an
introduction to Statistical Machine Translation. The statistical paradigm is one of the
predominants within machine translation. This is possibly due to the
simplicity of building a basic system with free software, the large
community behind it and, of course, the good results that it achieves.
The main objective of the session is to get to know the fundamentals
behind the three modules of a statistical system: the language model,
the translation model and the decoding or search for the best
translation. The presentation, although theoretical, is focused on
understanding how software such as SRILM and Moses work, what's the
logic behind them so that it is easy to understand the extensions and modifications available.
We also devote a small portion of time to see how these systems, and
machine translation systems in general, are evaluated automatically.
Machine translation evaluation is a delicate topic. Here we will put
the evaluation into context, describe in detail the standard metrics
and overview on the existing possibilities.
Finally, in a second part, the standard software will be introduced
and if there is time a toy SMT system will be build. Otherwise the
main steps for building it will be given.
2010
El Projecte MOLTO: Multi Lingual On-Line Translation
Cristina España-Bonet
Invited talk at the workshop La Indústria de la Traducció entre Llengües Romàniques, UPV, València, September 2010.
[
Abstract
Postscript
PDF
Slides
BibTeX
arXiv
]
L'objectiu final de MOLTO és desenvolupar un conjunt d'eines per a traduir textos entre diversos idiomes en temps real i amb alta qualitat. En aquestes eines cada llengua està pensada com un mòdul independent i, per tant, es pot afegir de manera directa sobre el sistema base. Dintre del projecte es construiran prototips per cobrir la major part dels 23 idiomes oficials a la UE.
Com a tècnica principal, MOLTO utilitza gramàtiques semàntiques de domini específic i interlingues basades en ontologies. Aquests components s'implementen en Grammatical Framework (GF), un formalisme de gramàtiques on es relacionen diversos idiomes a través d'una sintaxi abstracta comuna. El GF s'ha aplicat en diversos dominis de mida petita i mitjana, típicament per tractar fins a un total de deu idiomes, però MOLTO ampliarà això en termes de productivitat i aplicabilitat.
Part de l'ampliació es dedicarà a augmentar la mida dels dominis i el nombre d'idiomes. També és important fer la tecnologia accessible per als experts del domini sense experiència amb GFs i reduir al mínim l'esforç necessari per a la construcció d'un traductor. Idealment, això es pot aconseguir simplement estenent un lexicó i escrivint un conjunt de frases d'exemple.
Per altra banda les parts amb investigació més intensiva de MOLTO són la interoperabilitat entre estàndards d'ontologies (OWL) i les gramàtiques GF, i l'extensió de les traduccions basades en regles amb mètodes estadístics. L'interoperabilitat OWL-GF permetrà la interacció multilingüe basada en llenguatge natural amb coneixement vàlid per a les màquines. Els mètodes estadístics afegiran robustesa al sistema i caldrà desenvolupar nous mètodes per a combinar les gramàtiques GF amb la traducció estadística en benefici de tots dos.
Després dels tres anys que dura el projecte, la tecnologia de MOLTO serà lliurada com a llibreries de codi obert que podran ser connectades a les eines de traducció estàndard i pàgines web i, per tant, podran ser integrades en els fluxos de treball estàndard. En el procés, es crearan demostracions web i la metodologia s'aplicarà a tres estudis de cas: exercicis de matemàtiques en 15 idiomes, dades de patents en almenys 3 idiomes, i descripcions d'objectes de museus en 15 idiomes.
Es pot trobar informació addicional al web oficial http://www.molto-project.eu/.
Robust Estimation of Feature Weights in Statistical Machine Translation
Cristina España-Bonet, Lluís Màrquez
Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT), Saint-Raphaël, France, May 2010.
[
Abstract
Postscript
PDF
Poster
BibTeX
arXiv
]
Weights of the various components in a
standard Statistical Machine Translation model are usually estimated via Minimum Error Rate Training. With this, one finds their optimum value on a development set with the expectation that these optimal weights generalise well to other test sets. However, this is not always the case when domains differ. This work uses a perceptron algorithm to learn more robust weights to be used on out-of-domain corpora without the need for specialised data. For an Arabic-to-English translation system, the generalisation of weights represents an improvement of more than 2 points of BLEU with respect to the MERT baseline using the same information.
@InProceedings{espanaMarquez,
author = {{Espa{\~n}a-Bonet}, C. and {M\`arquez}, L.},
title = {Robust Estimation of Feature Weights in Statistical Machine Translation},
booktitle = {Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT'10)},
year = {2010},
month = {may},
date = {27-28},
address = {Saint-Rapha\"{e}l, France},
language = {english}
}
Language Technology Challenges of a 'small' Language (Catalan)
M. Melero, G. Boleda, M. Cuadros, C. España-Bonet, L. Padró, M. Quixal, C. Rodríguez, R. Saurí
Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC), Valletta, Malta, May 2010.
[
Abstract
Postscript
PDF
Poster
BibTeX
arXiv
]
In this paper, we present a brief snapshot of the state of affairs in computational processing of Catalan and the initiatives that are starting to take place in an effort to bring the field a step forward, by making a better and more efficient use of the already existing resources and tools, by bridging the gap between research and market, and by establishing periodical meeting points for the community. In particular, we present the results of the First Workshop on the Computational Processing of Catalan, which succeeded in putting together a fair representation of the research in the area, and received attention from both the industry and the administration. Aside from facilitating communication among researchers and between developers and users, the Workshop provided the organizers with valuable information about existing resources, tools, developers and providers. This information has allowed us to go a step further by setting up a "harvesting" procedure which will hopefully build the seed of a portal-catalogue-observatory of language resources and technologies in Catalan.
@InProceedings{MELERO10.628,
author = {Maite Melero, Gemma Boleda, Montse Cuadros, Cristina Espa{\~n}a-Bonet, Llu\'is Padr\'o, Mart\'i Quixal,
Carlos Rodr\'iguez and Roser Saur\'i},
title = {Language Technology Challenges of a 'Small' Language (Catalan)},
booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
year = {2010},
month = {may},
date = {19-21},
address = {Valletta, Malta},
editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis,
Mike Rosner, Daniel Tapias},
publisher = {European Language Resources Association (ELRA)},
isbn = {2-9517408-6-7},
language = {english}
}
Statistical Machine Translation - A practical tutorial
Cristina España-Bonet
Tutorial at MOLTO kick-off meeting, Barcelona, March 2010.
[
Abstract
PDF (to show)
PDF (to print)
arXiv
]
Tutorial for beginners in SMT. It is intended
to show the fundamentals in less than 90 minutes and includes some guidelines to construct a SMT
baseline.
Robust Estimation of Feature Weights in SMT
Cristina España-Bonet, Lluís Màrquez
Talk at OpenMT2 kick-off meeting, Ulia, Donostia, January 2010.
[
Abstract
Postscript
PDF
arXiv
]
Weights of the various components in a standard Statistical Machine Translation model are usually estimated via Minimum Error Rate Training. With this, one finds their optimum value on a development set with the expectation that these optimal weights generalise well to other test sets. However, this is not always the case when domains differ. Our work uses a perceptron algorithm to learn more robust weights to be used on out-of-domain corpora without the need for specialised data. For an Arabic-to-English translation system, the generalisation of the weights represents an
improvement of more than 2 points of BLEU with respect to the MERT baseline using exactly the same information.
2009
Discriminative Phrase-Based Models for Arabic Machine Translation
Cristina España-Bonet, Jesús Giménez, Lluís Màrquez
ACM Transactions on Asian Language Information Processing Journal (TALIP), vol. 8, No. 4, pag. 1-20. December, 2009.
[
Abstract
Postscript
PDF
Slides
BibTeX
arXiv
]
A design for an Arabic-to-English translation system is presented. The core of the system implements a standard Phrase-Based Statistical Machine Translation architecture, but it is extended by incorporating a local discriminative phrase selection model to address the semantic ambiguity of Arabic. Local classifiers are trained using linguistic information and context to translate a phrase, and this significantly increases the accuracy in phrase selection with respect to the most frequent translation traditionally considered. These classifiers are integrated into the translation system so that the global task gets benefits from the discriminative learning. As a result, we obtain significant improvements in the full translation task at the lexical, syntactic and semantic levels as measured by an heterogeneous set of automatic evaluation metrics.
@article{talip09,
author = {{Espa{\~n}a-Bonet}, C. and {Gim\'enez}, J. and {M\`arquez}, L.},
title = {Discriminative Phrase-Based Models for Arabic Machine Translation},
journal = {ACM Transactions on Asian Language Information Processing Journal (TALIP)},
year = 2010,
month = March,
volume = 8,
issue = 4,
month = December,
year = 2009,
pages = 1--20,
articleno = 15,
doi = {http://doi.acm.org/10.1145/1644879.1644882},
publisher = ACM,
}
CoCo, a web interface for corpora compilation
C. España-Bonet, M. Vila, H. Rodríguez, M.A. Martí
Procesamiento del Lenguaje Natural, 43, 367-368. September, 2009.
[
Abstract
Postscript
PDF
Poster
BibTeX
arXiv
]
CoCo is a collaborative web interface for the compilation of linguistic resources. In this demo we are presenting one of its possible applications: paraphrase acquisition.
@ARTICLE{seplncoco2009,
author = {{Espa{\~n}a-Bonet}, C. and {Vila}, M. and {Mart\'i}. M.A. and {Rodr\'iguez}, H.},
title = {CoCo, a web interface for corpora compilation},
journal = {Procesamiento del Lenguaje Natural},
volume = 43,
pages = {367-368},
year = 2009,
month = September
}
Conclusiones de la primera Jornada del Procesamiento Computacional del Catalán
G. Boleda, M. Cuadros, C. España-Bonet, M. Melero, L. Padró, M. Quixal, C. Rodríguez
Procesamiento del Lenguaje Natural, 43, 387-388. September, 2009.
[
Abstract
Postscript
PDF
Poster
BibTeX
arXiv
]
A partir de la constatación de que la comunidad de investigación de Procesamiento del Lenguaje Natural y del Habla en catalán precisaba mayor cohesión, se organizó una jornada (Jornada del Processament Computacional del Català, JPCC) que se celebró en el Palau Robert de Barcelona en
marzo de 2009. Los objetivos de la jornada eran (1) mejorar la comunicación y la colaboración entre los diferentes grupos de investigación, empresas e instituciones que desarrollan herramientas y recursos computacionales para el catalán, (2) encontrar maneras de aprovechar de forma eficiente los recursos existentes y, (3) dar visibilidad a la investigación en el tratamiento computacional del catalán.
@ARTICLE{seplnjpc2009,
author = {{Boleda}, G. and {Cuadros}, M. and {Espa{\~n}a-Bonet}, C. and {Melero}, M. and {Padr\'o}. L.
and {Quixal}, M. and {Rodr\'iguez}, C.},
title = {Conclusiones de la primera Jornada del Procesamiento Computacional del Catal\'an},
journal = {Procesamiento del Lenguaje Natural},
volume = 43,
pages = {387-388},
year = 2009,
month = September
}
Sobre la I Jornada del Processament Computacional del català
G. Boleda, M. Cuadros, C. España-Bonet, M. Melero, L. Padró, M. Quixal, C. Rodríguez
Llengua i Ús, vol 45, 23-32, 2009.
[
Abstract
Postscript
PDF
BibTeX
arXiv
]
El processament computacional de la llengua abraça qualsevol activitat relacionada amb la creació, gestió i utilització de tecnologia i recursos lingüístics. En el pla científic, aquesta activitat és central en disciplines com ara la lingüística de corpus, l'enginyeria lingüística o el processament del llenguatge natural escrit o parlat. En el pla quotidià, el processament s'inclou en un ampli ventall d'aplicacions cada cop més habituals: sistemes automàtics d'atenció telefònica, traducció automàtica, etc.
La gran majoria d'aquestes aplicacions requereixen eines i recursos lingüístics específics per a cada llengua. Per a llengües amb un mercat ampli, com l'anglès o el castellà, l'oferta de productes i serveis basats en tecnologia lingüística és variada i habitual. Per al cas de llengües com el català, és més difícil trobar productes i serveis que s'ofereixin ja "de fàbrica" amb aquesta tecnologia.
Per tal de reflectir l'estat actual de les tecnologies de la llengua aplicades al català, de posar en contacte els membres d'aquesta comunitat, i d'impulsar iniciatives que les potenciïn, al març del 2009 es va celebrar al Palau Robert de Barcelona la primera Jornada del Processament Computacional del Català. La Jornada tenia l'objectiu d'esdevenir un punt de trobada i alhora un aparador per als grups de recerca de l'àrea, i encetar el debat sobre com articular la comunitat per tal de potenciar l'ús i el desenvolupament del català tant en la tecnologia lingüística com en els productes i serveis que en depenen. Aquest article presenta un resum del contingut i les conclusions de la Jornada.
@ARTICLE{lsc09,
author = {{Boleda}, G. and {Cuadros}, M. and {Espa{\~n}a-Bonet}, C. and {Melero}, M. and {Padr\'o}. L.
and {Quixal}, M. and {Rodr\'iguez}, C.},
title = "Sobre la I Jornada del Processament Computacional del catal\`a",
journal = "Llengua i \'Us",
volume = 45,
pages = 23-32,
year = 2009
}
El català i les tecnologies de la llengua
G. Boleda, M. Cuadros, C. España-Bonet, M. Melero, L. Padró, M. Quixal, C. Rodríguez
Llengua, Societat i Comunicació, vol 7, 20-26, 2009.
[
Abstract
Postscript
PDF
BibTeX
arXiv
]
(See Introduction)
@ARTICLE{lsc09,
author = {{Boleda}, G. and {Cuadros}, M. and {Espa{\~n}a-Bonet}, C. and {Melero}, M. and {Padr\'o}. L.
and {Quixal}, M. and {Rodr\'iguez}, C.},
title = "El catal\`a i les tecnologies de la llengua",
journal = "Llengua, Societat i Comunicaci\'o",
volume = 7,
pages = "20--26",
year = 2009,
month = July
}
Type Ia SNe along redshift: the R(SiII) ratio and the expansion velocities in intermediate z supernovae
G. Altavilla, P. Ruiz-Lapuente, A. Balastegui, J. Mendez, M. Irwin, C. España-Bonet, R.S. Ellis, G. Folatelli, A. Goobar, W. Hillebrandt, R.M. McMahon, S. Nobili, V. Stanishev, N.A. Walton
The Astrophysical Journal, vol 695, 135-148, 2009
[
Abstract
Postscript
PDF
Slides
BibTeX
arXiv
]
We present a study of intermediate-z SNe Ia using the empirical physical diagrams which permit the investigation of those SNe explosions. This information can be very useful to reduce systematic uncertainties of the Hubble diagram of SNe Ia up to high z. The study of the expansion velocities and the measurement of the ratio R(SiII) allow subtyping of SNe Ia as done in nearby samples. The evolution of this ratio as seen in the diagram R(SiII)-(t) together with R(SiII)_max versus (B-V)_0 indicates consistency of the properties at intermediate-z compared with the nearby SNe Ia. At intermediate-z, expansion velocities of Ca II and Si II are found similar to those of the nearby sample. This is found in a sample of six SNe Ia in the range 0.033≤z≤0.329 discovered within the International Time Programme of SNe Ia for Cosmology and Physics in the spring run of 2002.The program run under "Omega and Lambda from Supernovae and the Physics of Supernova Explosions" within the International Time Programme at the telescopes of the European Northern Observatory (ENO) at La Palma (Canary Islands, Spain). Two SNe Ia at intermediate-z were of the cool FAINT type, one being an SN1986G-like object highly reddened. The R(SiII) ratio as well as subclassification of the SNe Ia beyond templates help to place SNe Ia in their sequence of brightness and to distinguish between reddened and intrinsically red supernovae. This test can be done with very high z SNe Ia and it will help to reduce systematic uncertainties due to extinction by dust. It should allow to map the high-z sample into the nearby one.
@ARTICLE{midzsne2009,
author = {{Altavilla}, G. and {Ruiz-Lapuente}, P. and {Balastegui}, A. and {Mendez}, J. and {Irwin}, M. and
{Espa{\~n}a-Bonet}, C. and {Ellis}, R.~S. and {Folatelli}, G. and {Goobar}, A. and {Hillebrandt}, W.
and {McMahon}, R.~M. and {Nobili}, S. and {Stanishev}, V. and {Walton}, N.~A.},
title = "{Type Ia SNe along redshift: the R(Si II) ratio and the expansion velocities in intermediate z supernovae}",
journal = {Astrophysical Journal},
eprint = {arXiv:astro-ph/0610143},
year = 2009,
month = april,
volume = 695,
pages = {135-148},
doi = {10.1088/0004-637X/695/1/135},
}
Discriminative learning within Arabic Statistical Machine Translation
Cristina España-Bonet, Jesús Giménez, Lluís Màrquez
Research Report LSI-09-3-R
[
Abstract
Postscript
PDF
Slides
BibTeX
arXiv
]
Written Arabic is a especially ambiguous due to the lack of diacritisation of texts, and this makes the translation harder for automatic systems that do not take into account the context of phrases. Here, we use a standard Phrase-Based Statistical Machine Translation architecture to build an Arabic-to-English translation system, but we extend it by incorporating a local discriminative phrase selection model which addresses this semantic ambiguity. Local classifiers are trained using both linguistic information and context to translate a phrase, and this significantly increases the accuracy in phrase selection with respect to the most frequent translation traditionally considered. These classifiers are integrated into the translation system so that the global task gets benefits from the discriminative learning. As a result, we obtain improvements in the full translation of Arabic documents at the lexical, syntactic and semantic levels as measured by an heterogeneous set of automatic metrics.
@TechReport{cespanaLSI093R,
author = {{Espa{\~n}a-Bonet}, C. and {Gim\'enez}, J. and {M\`arquez}, L.},
title = {Discriminative learning within Arabic Statistical Machine Translation},
institution = {LSI, UPC},
year = {2009},
month = {January},
type = {Research Report},
number = {LSI-09-3-R}
}
2008
The UPC-LSI Discriminative Phrase Selection System: NIST MT Evaluation 2008
Cristina España-Bonet, Jesús Giménez, Lluís Màrquez
Proceedings of the 2008 NIST Open Machine Translation Evaluation Workshop
[
Abstract
Postscript
PDF
Slides
BibTeX
arXiv
]
This document describes the system developed by the Empirical MT Group at the Technical University of Catalonia, LSI Department, for the Arabic-to-English task at the 2008 NIST MT Evaluation Campaign. Our system explores the application of discriminative learning to the problem of phrase selection in Statistical Machine Translation. Instead of relying on Maximum Likelihood estimates for the construction of translation models, we use local classifiers which
are able to take further advantage of contextual information. Local predictions are softly integrated into a global log-linear phrase-based statistical MT system as an additional feature. Automatic evaluation results according to a heterogeneous set of metrics operating at different linguistic levels are
presented. These show a low level of agreement between metrics. Improvements over the baseline are either inexistent or not significant, except for the case of semantic metrics based on discourse representations and several syntactic metrics based on constituent and dependency parsing.
@InProceedings{,
author = {{Espa{\~n}a-Bonet}, C. and {Gim\'enez}, J. and {M\`arquez}, L.},
title = {The UPC-lsi Discriminative Phrase Selection System: NIST MT Evaluation 2008},
year = {2008},
organization = {NIST Open Machine Translation Evaluation Workshop}
}
A proposal for an Arabic-to-English SMT
Cristina España-Bonet
Master Thesis, Universitat de Barcelona and Universitat Politècnica de Catalunya (Artificial Intelligence Program)
[
Abstract
Postscript
PDF
Slides
BibTeX
arXiv
]
Snippet of the Introduction:
The aim of this work is to apply MT techniques to translate from Arabic to English in the context of the 2008 NIST Machine Translation Open evaluation. For the core of the system we choose a SMT architecture. With a standard SMT system we check the improvements given by adding linguistic information, that is,
maximise the probability not only of the sequence of words, but of its lemma, part-of-speech and chunk as well. We increase the amount of linguistic knowledge but we also increase the sparsity in the corpus because the combination of features increases the vocabulary. We explore several approaches to these combinations.
As a second method, we use machine learning (ML) techniques to select the most adequate translation phrases and combine them with the output of the SMT system. We treat the translation task as a classification problem and use the linguistic information and the context of each word as features to train the classifiers. This methodology is used in Word Sense Disambiguation and should help to select the
correct translation of a phrase according to its context. We analyse the results of this subtask and quantify the impact in the results. The output of this phase is inserted into the SMT system by enlarging the translation table with every sense of a phrase and with the inclusion of a new probability score, which accounts for the result of the classifier. We compare the results with and without this additional
information. This combination of SMT and ML, MLT, is our final proposal for the
Arabic-to-English SMT system.
@MastersThesis{crisSMTdea,
author = {{Espa{\~n}a-Bonet}, C.},
title = {A proposal for an Arabic-to-English SMT},
school = {Universitat de Barcelona and Universitat Polit\`ecnica de Catalunya},
year = 2008,
month = February
}
Exploring the evolution of dark energy and its equation of state
Cristina España-Bonet
Ph.D. Thesis, Universitat de Barcelona (Astronomy and Astrophysics Program)
[
Abstract
Postscript
PDF
Slides
BibTeX
arXiv
]
Abstract: To be included
@PhdThesis{crisTesi,
author = {{Espa{\~n}a-Bonet}, C.},
title = {Exploring the evolution of dark energy and its equation of state},
school = {Departament d'Astronomia i Meteorologia, Universitat de Barcelona},
year = 2008,
month = February
}
Tracing the equation of state and the density of cosmological constant along z
Cristina España-Bonet, Pilar Ruiz-Lapuente
Journal of Cosmology and Astro-Particle Physics, vol 02, pag 18+, 2008
[
Abstract
Postscript
PDF
Slides
BibTeX
arXiv
]
We investigate the equation of state w(z) in a non-parametric form using the latest compilations of the luminosity distance from SNe Ia at high z. We combine the inverse problem approach with a Monte Carlo method to scan the space of priors. In the light of the latest high redshift supernova data sets, we reconstruct w(z). A comparison between a sample including the latest results at z>1 and a sample without those results shows the improvement achieved through observations of very high z supernovae. We present the prospects for measuring the variation of dark energy density along z by this method.
@ARTICLE{2008JCAP...02..018E,
author = {{Espa{\~n}a-Bonet}, C. and {Ruiz-Lapuente}, P.},
title = "{Tracing the equation of state and the density of the cosmological constant along z}",
journal = {Journal of Cosmology and Astro-Particle Physics},
archivePrefix = "arXiv",
eprint = {0805.1929},
year = 2008,
month = feb,
volume = 2,
pages = {18-+},
doi = {10.1088/1475-7516/2008/02/018},
adsurl = {http://adsabs.harvard.edu/abs/2008JCAP...02..018E},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
2006
Type Ia SNe along redshift: the R(Si II) ratio and the expansion velocities in intermediate z supernovae
G. Altavilla, P. Ruiz-Lapuente, A. Balastegui, J. Mendez, M. Irwin, C. España-Bonet, K. Schamaneche, C. Balland, R.S. Ellis, S. Fabbro, G. Folatelli, A. Goobar, W. Hillebrandt, R.M. McMahon, M. Mouchet, A. Mourao, S. Nobili, R. Pain, V. Stanishev, N.A. Walton
Submitted to The Astrophysical Journal
[
Abstract
Postscript
PDF
Slides
BibTeX
arXiv
]
We study intermediate-z SNe Ia using the empirical physical diagrams which enable to learn about those SNe explosions. This information can be very useful to reduce systematic uncertainties of the Hubble diagram of SNe Ia up to high z. The study of the expansion velocities and the measurement of the ratio R(SiII) allow to subtype those SNe Ia as done for nearby samples. The evolution of this ratio as seen in the diagram R(SiII)-(t) together with R(SiII)_max versus (B-V)_0 indicate consistency of the properties at intermediate z compared with local SNe. At intermediate-z, the expansion velocities of Ca II and Si II are similar to the nearby counterparts. This is found in a sample of 6 SNe Ia in the range 0.033≤z≤0.329 discovered within the International Time Programme (ITP) of Cosmology and Physics with SNe Ia during the spring of 2002. Those supernovae were identified using the 4.2m William Herschel Telescope. Two SNe Ia at intermediate z were of the cool FAINT type, one being a SN1986G-like object highly reddened. The R(SiII) ratio as well as subclassification of the SNe Ia beyond templates help to place SNe Ia in their sequence of brightness and to distinguish between reddened and intrinsically red supernovae. This test can be done with very high z SNe Ia and it will help to reduce systematic uncertainties due to extinction by dust. It should allow to map the high-z sample into the nearby one.
@ARTICLE{2006astro.ph.10143A,
author = {{Altavilla}, G. and {Ruiz-Lapuente}, P. and {Balastegui}, A. and {Mendez}, J. and {Irwin}, M. and
{Espa{\~n}a-Bonet}, C. and {Schamaneche}, K. and {Balland}, C. and {Ellis}, R.~S. and {Fabbro}, S. and
{Folatelli}, G. and {Goobar}, A. and {Hillebrandt}, W. and {McMahon}, R.~M. and {Mouchet}, M. and
{Mourao}, A. and {Nobili}, S. and {Pain}, R. and {Stanishev}, V. and {Walton}, N.~A.},
title = "{Type Ia SNe along redshift: the R(Si II) ratio and the expansion velocities in intermediate z supernovae}",
journal = {ArXiv Astrophysics e-prints},
eprint = {arXiv:astro-ph/0610143},
keywords = {Astrophysics},
year = 2006,
month = oct,
adsurl = {http://adsabs.harvard.edu/abs/2006astro.ph.10143A},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
2004
Dark Energy as an Inverse Problem
Cristina España-Bonet, Pilar Ruiz-Lapuente
Poster at JENAM The many scales in the Universe, IAA, Granada, September 2004
[
Abstract
Postcript
JPG
Slides
BibTeX
arXiv
]
In order to improve the information on dark energy it is not only important
to have a large number of data of a good quality, but also to know where are
these data more profitable and then explode all the statistical methods to
extract the information. We apply here the Inverse Problem Theory to
determine the parameters appearing in the equation of state and the
functional form itself. Using this method it is also determined which would
be the best distribution of high redshift data to study the equation of state
of dark energy, i.e., with which distribution it is obtained a best quality
of the inversion. Supernovae magnitudes are used alone and together with
other sources such as radio galaxies and compact radio sources.
Viabilitat d'una Constant Cosmològica variable. Contrast amb SNeIa.
Cristina España-Bonet
Master Thesis (DEA), Universitat de Barcelona (Astronomy and Astrophysics Program)
[
Abstract
Postscript
PDF
Slides
BibTeX
arXiv
]
En aquest treball s'analitza detalladament el comportament de la constant
cosmològica des del punt de vista d'una teoria quàntica de camps. Un cop
obtinguda l'evolució que experimenta a baixes energies, aquesta ens ha permès
comparar les prediccions del model amb altres formes d'energia fosca, amb
les SNeIa a alt redshift com a eina per a fer aquesta comparació.
A partir del canvi de la seva magnitud amb el redshift s'ha comprovat
que aquesta família de models és perfectament compatible amb les observacions
i, per tant, no es pot descartar la possibilitat de que la constant
cosmològica evolucioni amb el temps. La consideració de diferents escenaris
ha permès ajustar paràmetres com la massa dels neutrins lleugers
(mν~ 0,01eV), contrastar la compatibilitat del model estàndard
de partícules amb dades astrofísiques, i determinar paràmetres
relacionats amb la física que es pugui
donar a l'època de Planck. A banda dels resultats que es troben amb les dades
actuals, diverses simulacions de conjunts de dades futures com les del
projecte SNAP han ajudat a veure la contrastabilitat del model. Així, els
projectes planejats per a obtenir SNeIa a alt redshift per a l'estudi
de l'energia fosca seran suficients, en la major part dels casos, per a
verificar o no l'evolució de la constant cosmològica.
@MastersThesis{crisAstroDEA,
author = {{Espa{\~n}a-Bonet}, C.},
title = {Viabilitat d'una Constant Cosmològica variable. Contrast amb SNeIa.},
school = {Dept. Astronomia i Meteorologia, Universitat de Barcelona},
year = 2004,
month = September
}
Testing the running of the cosmological constant with Type Ia Supernovae at high z
Cristina España-Bonet, Pilar Ruiz-Lapuente, Ilya L. Shapiro, Joan Solà
Journal of Cosmology and Astro-Particle Physics, vol 02, pag 6+, 2004
[
Abstract
Postscript
PDF
Slides
BibTeX
arXiv
]
Within the Quantum Field Theory context the idea of a cosmological constant (CC) evolving with time looks quite natural as it just reflects the change of the vacuum energy with the typical energy of the universe. In the particular frame of Ref.[30], a running CC at low energies may arise from generic quantum effects near the Planck scale, MP , provided there is a smooth decoupling of all massive particles below MP . In this work we further develop the cosmological consequences of a running CC by addressing the accelerated evolution of the universe within that model. The rate of change of the CC stays slow, without fine-tuning, and is comparable to H2 MP2. It can be described by a single parameter, ν, that can be determined from already planned experiments using SNe Ia at high z. The range of allowed values for ν follow mainly from nucleosynthesis restrictions. Present samples of SNe Ia can not yet distinguish between a constant CC or a running one. The numerical simulations presented in this work show that SNAP can probe the predicted variation of the CC either ruling out this idea or confirming the evolution hereafter expected.
@ARTICLE{2004JCAP...02..006E,
author = {{Espa{\~n}a-Bonet}, C. and {Ruiz-Lapuente}, P. and {Shapiro}, I.~L. and {Sol{\`a}}, J.},
title = "{Testing the running of the cosmological constant with type Ia supernovae at high z}",
journal = {Journal of Cosmology and Astro-Particle Physics},
eprint = {arXiv:hep-ph/0311171},
year = 2004,
month = feb,
volume = 2,
pages = {6-+},
adsurl = {http://adsabs.harvard.edu/abs/2004JCAP...02..006E},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
2003
Variable Cosmological Constant as a Planck scale effect
Ilya L. Shapiro, Joan Solà, Cristina España-Bonet, Pilar Ruiz-Lapuente
Physics Letters B, 574, pag 149-155, 2003
[
Abstract
Postscript
PDF
Slides
BibTeX
arXiv
]
We construct a semiclassical FLRW cosmological model assuming a running cosmological constant (CC). It turns out that the CC becomes variable at arbitrarily low energies due to the remnant quantum effects of the heaviest particles, e.g. the Planck scale physics. These effects are universal in the sense that they lead to a low-energy structure common to a large class of high-energy theories. Remarkably, the uncertainty concerning the unknown high-energy dynamics is accumulated into a single parameter ν, such that the model has an essential predictive power. Future Type Ia supernovae experiments (like SNAP) can verify whether this framework is correct. For the flat FLRW case and a moderate value ν~0.01, we predict an increase of 10-20% in the value of ΩΛ at redshifts z=1-1.5 perfectly reachable by SNAP.
@ARTICLE{2003PhLB..574..149S,
author = {{Shapiro}, I.~L. and {Sol{\`a}}, J. and {Espa{\~n}a-Bonet}, C. and {Ruiz-Lapuente}, P.},
title = "{Variable cosmological constant as a Planck scale effect}",
journal = {Physics Letters B},
eprint = {arXiv:astro-ph/0303306},
year = 2003,
month = nov,
volume = 574,
pages = {149-155},
doi = {10.1016/S0370-2693(03)01376-5},
adsurl = {http://adsabs.harvard.edu/abs/2003PhLB..574..149S},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
Supernovae and Cosmology
Cristina España-Bonet
Talk given at Dpt. Estructura i Constituents de la Matèria (Universitat de Barcelona), Dpt. Física i Enginyeria Nuclear (Universitat Politècnica de Catalunya) and Institut de Física d'Altes Energies.
[
Abstract
Postscript
PDF
Slides
BibTeX
arXiv
]
Les supernoves tipus Ia (SNe Ia) són les úniques candeles estàndard
que es coneixen a alts redshifts. Això fa que s'haguin convertit en
una de les principals eines per la cosmologia. Aquí s'explicarà com
la determinació de la seva distància lluminositat permet discriminar
entre diferents models cosmològics (amb principal atenció a com s'ha
arribat a la conclusió més acceptada que vivim en un univers en
expansió accelerada dominat per la constant cosmològica) i en quin
estat es troben actualment les investigacions.
2002
Present-day running of the cosmological constant
Cristina España-Bonet, Pilar Ruiz-Lapuente
Poster at the Winter School Dark matter and dark energy in the Universe, IAC, Tenerife, November 2002
[
Abstract
Postscript
PDF
Slides
BibTeX
arXiv
]
A particle physics account of the cosmological constant is given through two
different approaches. Both of them use the equations of quantum field theory
and so the cosmological "constant" has its own renormalization group
equation (RGE). The obtained running is then introduced into the theoretical
expression for the magnitude-redshift relation, so that a minimization of
the residuals with the observational data from supernovae (SN) allows us to
fit some parameters. Among the latter are the lightest neutrino masses, for
which the best value is mν = 0.004-0.005 eV (with the possible
presence of a sterile light field). Future applications of the type of
analysis presented here are finally pointed out.
Present-day running of the cosmological constant
Cristina España-Bonet, Pilar Ruiz-Lapuente
Proceedings from On the nature of dark energy, IAP, Paris, July 2002
[
Abstract
Postscript
PDF
Slides
BibTeX
arXiv
]
A particle physics account of the cosmological constant is given through two
different approaches. Both of them use the equations of quantum field theory
and so the cosmological "constant" has its own renormalization group
equation (RGE). The obtained running is then introduced into the theoretical
expression for the magnitude-redshift relation, so that a minimization of
the residuals with the observational data from supernovae (SN) allows us to
fit some parameters. Among the latter are the lightest neutrino masses, for
which the best value is mν = 0.004-0.005 eV (with the possible
presence of a sterile light field). Future applications of the type of
analysis presented here are finally pointed out.