Project Summary
MEANING will be concerned with automatically collecting and analysing language
data from the WWW on a large scale, and building more comprehensive multilingual
lexical knowledge bases to support improved word sense disambiguation (WSD).
Current web access applications are based on words; MEANING will open
the way for access to the Multilingual Web based on concepts, providing
applications with capabilities that significantly exceed those currently
available. MEANING will facilitate development of concept-based open domain
Internet applications (such as Question/Answering, Cross Lingual Information
Retrieval, Summarisation, Text Categorisation, Event Tracking, Information
Extraction, Machine Translation, etc.). Furthermore, MEANING will supply
a common conceptual structure to Internet documents, thus facilitating
knowledge management of web content.
Progress is being made in Human Language Technology (HLT) but there
is still a long way towards Natural Language Understanding (NLU). An important
step towards this goal is the development of technologies and resources
that deal with concepts rather than words. MEANING will develop concept-based
technologies and resources through large-scale knowledge processing over
the web, robust and fast machine learning algorithms, very large lexical
resources and novel strategies for combining them. Small-scale, isolated
experiments with limited infrastructure (such as Internet access, processing
power, and storage space) have no chance of bridging the gap to understanding.
Advances in this area can only be expected in the context of large-scale
long-term research projects.
MEANING will treat the web as a (huge) corpus to learn information from,
since even the largest conventional corpora available (e.g. the Reuters
corpus, the British National Corpus) are not large enough to be able to
acquire reliable information in sufficient detail about language behaviour.
Moreover, most European languages do not have large or diverse enough corpora
available.
Even now, building large and rich knowledge bases takes a great deal
of expensive manual effort; this has severely hampered HLT application
development. For example, dozens of person-years have been invest into
the development of wordnets for various languages, but the data in these
resources is still not sufficiently rich to support advanced concept-based
HLT applications directly. Furthermore, resources produced by introspection
usually fail to register what really occurs in texts. Applications will
not scale up to working in the open domain without more detailed and rich
general-purpose and also domain-specific linguistic knowledge. To be able
to build the next generation of intelligent open domain HLT application
systems we need to solve two complementary intermediate tasks: Word Sense
Disambiguation (WSD) and large-scale enrichment of Lexical Knowledge Bases.
However, progress is difficult due to the following paradox:
In order to enrich Lexical Knowledge Bases we need to acquire information
from corpora, which have been accurately tagged with word senses.
In order to achieve accurate WSD, we need far more linguistic and semantic
knowledge than is available in current lexical knowledge bases.
The major objective of MEANING is to innovate technology to solve this
problem. MEANING will use state of the art NLP techniques pioneered by
the consortium to enhance EuroWordNet with mainly language-independent
lexico-semantic (concept) information. We will use a combination of Machine
Learning and Knowledge-Based techniques in order to enrich the structure
of the wordnets in different domains (subsets of the web) in five European
languages: English, Italian, Spanish, Catalan and Basque. The core technology
used by MEANING will include tools to perform language identification,
morphological analysis, part-of-speech tagging, named-entity recognition
and classification, sentence boundary detection, shallow parsing and text
categorization. MEANING will produce:
-
A Tool Set for obtaining automatically from the web large collections of
concept-based data sets. This Tool Set will use the semantic knowledge
of EuroWordNet to obtain automatically from the web large collections of
examples for each particular word sense.
-
A Tool Set for enriching automatically EuroWordNet. The knowledge acquired
using these tools will support the interface between the syntactic and
the semantic layers. This Tool Set will include a set of specific tools
for acquiring information including domain terminology, new senses, clusters
of related senses, topic signatures, Diathesis Alternations, Subcategorization
Frames (including prepositional constraints), Selectional Preferences (i.e.
typical objects, subjects, etc.), and specific lexico-semantic relations
(i.e. purpose, location etc.).
-
A Tool Set for selecting accurately the senses of the open-class words
for the languages involved in the project. This WSD system will rely on
robust, advanced Machine Learning algorithms able to model the behaviour
of each word sense from labelled and unlabelled text.
MEANING will also develop a Multilingual Central Repository to maintain
compatibility between wordnets of different languages and versions, past
and new. The acquired knowledge from each language will be consistently
uploaded to the Multilingual Central Repository and ported over to the
other wordnets involved in the project. MEANING will also produce a semantically
annotated corpus for each wordnet word sense, that is, a Multilingual Web
corpus with semantically annotated corpora containing concept and domain
labels.
All of these tools and data will be readily usable by users of different
wordnets (including EuroWordNet and future versions of the WordNet financed
by the NSF), using automatic tools for mapping the concepts between the
different versions. Enriching EuroWordNet with mostly language-independent
information will allow us to port newly acquired semantic information from
one language to the others. This will be possible because a large portion
of EuroWordNet's conceptual structure is language independent.
Research in MEANING will also cover new methods for terminology acquisition,
keyword identification, topic detection, domain classification, text classification
and wordnet adaptation (including identification of new senses and clustering
of concept sets).
The results provided by MEANING will be directly used by any multilingual
Internet applications. MEANING will release a Showcase for evaluating the
products of the project. The Showcase will include test beds and demonstrations
of the enhanced wordnets in WSD, concept based Cross-lingual Information
Retrieval and multilingual Q&A (Question and Answer) Systems that will
try to show improvement over a baseline state-of-the-art traditional word-based
system.