Menu Content/Inhalt


Data release

Past Editions

QAst 2007
QAst 2008


Current Question Answering (QA) technology is focused mainly on the mining of written text sources for extracting the answer to questions both from open-domain and restricted-domain document collections. However, most human interaction occurs through speech, e.g. meetings, seminars, lectures, telephone conversations. All these scenarios provide large amounts of information that could be mined by QA systems. As a consequence, the exploitation of speech sources brings QA a step closer to many real world applications.

In addition, speech transcriptions differ from classical written text in many aspects, and this makes QA for speech transcriptions an interesting research area. The most common differences are:
  1. The repetition of words (e.g., "/I don't know where where the people will be/").
  2. The use of onomatopoeias.
  3. The lack of punctuation marks.
  4. The lack of capitalization.
  5. The presence of word errors due to the use of automatic speech recognizers (ASR). Typical errors are due to the lack of words in the language models (e.g., proper names in general), and the lack of representation in the acoustic model. In general, these errors are substitutions of word sequences for another ones (e.g., "feature" -> "feather", "Barcelona" -> "bars alone"), but never typo errors.
These differences are the reasons why extracting answers from transcribed speech requires more flexible QA architectures than those typically used for written text.

General goal:

The aim of this third year of QAST is to provide a framework in which QA systems can be evaluated in a real scenario, where the answers of oral and written questions (factual and definitional) in English, French and Spanish have to be extracted from speech transcriptions (manual and automatic transcriptions) in the respective language.
The particular scenario consists in answering oral and written questions related to speech presentations
- European Parliament Sessions (Spanish or English) and Broadcast News (French) -.
Relevant points will be:
  1. Comparing the performances of the systems dealing with spontaneous oral questions or written questions, with answers from both types of speech transcriptions and with both types of questions (factual and definitional).
  2. Measuring the loss of each system due to the state of the art ASR technology.
  3. Measuring the loss of each system due to the ASR output degradation.
  4. In general, motivating and driving the design of novel and robust QA architectures for automatic speech transcriptions.
The proposed evaluation of QA on automatic speech transcriptions can be best understood from the perspective of the target application: searching audio streams with natural language questions. In this application, the input is a spontaneous oral question or a written question that is matched against the automatic transcriptions generated behind the scenes for all the audio streams available. However, even though the QA system searches automatic transcriptions, the output made available to the user is start/end pointers in the audio stream where the exact answer is located.

Consider the following example: one audio stream contains the information "Jacques Chirac went to Berlin" and the user wants to know where the French president has been: "Where did Jacques Chirac go?". If perfect transcriptions of the audio stream were available, this example would have an obvious solution and the whole problem would be no different than regular QA on written text. However, consider the case when the automatic transcription of the above stream contains two errors: "went" is transcribed as "ate" and "Berlin" as "Barcelona". Hence the automatic transcription of the full stream is: "Jacques Chirac ate to Barcelona". In this case, the correct answer to be extracted is "Barcelona", because this is the text that points to the correct answer in the audio stream.

The above example illustrates the two principles that guide this track:
  1. The questions must be generated considering the exact information in the audio stream regardless of how this information is transcribed, because the transcription process is transparent to the user. In other words, in the above example, the question should focus on where did the president go, rather than what he ate, which was the ASR error.
  2. The answer to be extracted (hence the answer to be annotated in the automatic transcription) is the pair <start-time, end-time> corresponding to the minimal sequence of consecutive words that includes the correct exact answer in the audio stream. In the above example, the answer to be extracted from the automatic transcription is "Barcelona", because this text gives the start/end pointers to the correct answer in the audio stream.