Developing Multilingual Web-scale Language Technologies IST-2001-34460


The project will be funded by the EU 5th Framework IST Programme (subject to successful contract negotiations). The project duration is 3 years with a likely start date of 1st March 2002.

Project Summary

MEANING will be concerned with automatically collecting and analysing language data from the WWW on a large scale, and building more comprehensive multilingual lexical knowledge bases to support improved word sense disambiguation (WSD).

Current web access applications are based on words; MEANING will open the way for access to the Multilingual Web based on concepts, providing applications with capabilities that significantly exceed those currently available. MEANING will facilitate development of concept-based open domain Internet applications (such as Question/Answering, Cross Lingual Information Retrieval, Summarisation, Text Categorisation, Event Tracking, Information Extraction, Machine Translation, etc.). Furthermore, MEANING will supply a common conceptual structure to Internet documents, thus facilitating knowledge management of web content.

Progress is being made in Human Language Technology (HLT) but there is still a long way towards Natural Language Understanding (NLU). An important step towards this goal is the development of technologies and resources that deal with concepts rather than words. MEANING will develop concept-based technologies and resources through large-scale knowledge processing over the web, robust and fast machine learning algorithms, very large lexical resources and novel strategies for combining them. Small-scale, isolated experiments with limited infrastructure (such as Internet access, processing power, and storage space) have no chance of bridging the gap to understanding. Advances in this area can only be expected in the context of large-scale long-term research projects.

MEANING will treat the web as a (huge) corpus to learn information from, since even the largest conventional corpora available (e.g. the Reuters corpus, the British National Corpus) are not large enough to be able to acquire reliable information in sufficient detail about language behaviour. Moreover, most European languages do not have large or diverse enough corpora available.

Even now, building large and rich knowledge bases takes a great deal of expensive manual effort; this has severely hampered HLT application development. For example, dozens of person-years have been invest into the development of wordnets for various languages, but the data in these resources is still not sufficiently rich to support advanced concept-based HLT applications directly. Furthermore, resources produced by introspection usually fail to register what really occurs in texts. Applications will not scale up to working in the open domain without more detailed and rich general-purpose and also domain-specific linguistic knowledge. To be able to build the next generation of intelligent open domain HLT application systems we need to solve two complementary intermediate tasks: Word Sense Disambiguation (WSD) and large-scale enrichment of Lexical Knowledge Bases. However, progress is difficult due to the following paradox:

In order to enrich Lexical Knowledge Bases we need to acquire information from corpora, which have been accurately tagged with word senses.

In order to achieve accurate WSD, we need far more linguistic and semantic knowledge than is available in current lexical knowledge bases.

The major objective of MEANING is to innovate technology to solve this problem. MEANING will use state of the art NLP techniques pioneered by the consortium to enhance EuroWordNet with mainly language-independent lexico-semantic (concept) information. We will use a combination of Machine Learning and Knowledge-Based techniques in order to enrich the structure of the wordnets in different domains (subsets of the web) in five European languages: English, Italian, Spanish, Catalan and Basque. The core technology used by MEANING will include tools to perform language identification, morphological analysis, part-of-speech tagging, named-entity recognition and classification, sentence boundary detection, shallow parsing and text categorization. MEANING will produce:

MEANING will also develop a Multilingual Central Repository to maintain compatibility between wordnets of different languages and versions, past and new. The acquired knowledge from each language will be consistently uploaded to the Multilingual Central Repository and ported over to the other wordnets involved in the project. MEANING will also produce a semantically annotated corpus for each wordnet word sense, that is, a Multilingual Web corpus with semantically annotated corpora containing concept and domain labels.

All of these tools and data will be readily usable by users of different wordnets (including EuroWordNet and future versions of the WordNet financed by the NSF), using automatic tools for mapping the concepts between the different versions. Enriching EuroWordNet with mostly language-independent information will allow us to port newly acquired semantic information from one language to the others. This will be possible because a large portion of EuroWordNet's conceptual structure is language independent.

Research in MEANING will also cover new methods for terminology acquisition, keyword identification, topic detection, domain classification, text classification and wordnet adaptation (including identification of new senses and clustering of concept sets).

The results provided by MEANING will be directly used by any multilingual Internet applications. MEANING will release a Showcase for evaluating the products of the project. The Showcase will include test beds and demonstrations of the enhanced wordnets in WSD, concept based Cross-lingual Information Retrieval and multilingual Q&A (Question and Answer) Systems that will try to show improvement over a baseline state-of-the-art traditional word-based system.

Internal documents

Short Presentation of the MEANING project

Technical Annex of the MEANING project

Some Background References

Agirre E. and D. Martinez (2001) Learning class-to-class selectional preferences. In Proceedings of the Workshop Computational Natural Language Learning (CoNLL-2001) at ACL/EACL'01. Toulouse, France.

Escudero G., L. Marquez and G. Rigau (2000) An empirical study of the domain dependence of supervised word sense disambiguation systems. In Proceedings of Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC'00). Hong Kong, China.

Magnini, B. and C. Strapparava (2000) Experiments in word domain disambiguation for parallel texts. In Proceedings of the ACL Workshop on Word Senses and Multilinguality. Hong Kong, China.

McCarthy, D., J. Carroll and J. Preiss (2001) Disambiguating noun and verb senses using automatically acquired selectional preferences. In Proceedings of the SENSEVAL-2 Workshop at ACL/EACL'01. Toulouse, France.

Vossen P. (1999) EuroWordNet General Document. EuroWordNet LE2-4003, LE4-8328. 1999.