The data for the QAst pilot track consists of
two different resources, one for dealing with the lecture scenario and
the other for dealing with the meeting scenario:
Lecture scenario: the CHIL corpus will be
used. It consists
of around 25 hours (around 1 hour per lecture) both manually
and automatically transcribed (LIMSI has volunteered to
produce the ASR transcriptions). In addition, the set of
lattices and confidences for each lecture will be probably provided (also
produced by LIMSI). The domain of the lectures is
"speech and language processing". The language is
European English, mostly spoken by non native speakers.
Meeting scenario: the AMI corpus will be
used. It consists of
around 100 hours (168 meetings from the whole set of 170) both
manually and
automatically transcribed (the AMI RT 2006 ASR will be
used). The domain of this meetings is "design of television
remote control". The language is European English, mostly
spoken by non native speakers.
WARNING: Although the public release of the AMI corpus
contains the hand-annotated named entities, they cannot be
used in any way during the QAst evaluation.
The data will be prepared and made available by ELDA.
Instructions to get the data
If you are interested in participating in the QAst tasks,
please download this document and follow the instructions
explained in order to be provided with the appropriated data
by ELDA