Current factual Question Answering (QA)
technology is focused mainly on the mining of written text
sources for extracting the answer to questions both from
open-domain and restricted-domain document
collections. However, most human interaction occurs through
spontaneous speech, e.g. meetings, seminars, lectures,
telephone conversations. All these scenarios provide large
amounts of information that could be mined by factual QA
systems. As a consequence, the exploitation of spontaneous speech sources brings
factual QA a step closer to many real world applications.
In addition, spontaneous speech transcriptions differ from classical written text in many aspects, and this makes QA for spontaneous speech transcriptions an interesting research area. The most common differences are:
The repetition of words (e.g., "I don't know where where the people will be").
The use of onomatopoeias.
The lack of punctuation marks.
The lack of capitalization.
The presence of word errors due to the use of automatic speech recognizers (ASR). Typical errors are due to the lack of words in the language models (e.g., proper names in general), and the lack of representation in the acoustic model. In general, these errors are substitutions of word sequences for another ones (e.g., "feature" -> "feather", "Barcelona" -> "bars alone"), but never typo errors.
These differences are the reasons why extracting answers from transcribed spontaneous speech requires more flexible factual QA architectures than those typically used for written text.
Existing evaluation frameworks in the state
of the art do not evaluate factual QA systems for oral
transcriptions. The aim of this experimental pilot track is to
provide a framework in which factual QA systems can be
evaluated when the answers of factual questions written in english have
to be extracted from spontaneous speech transcriptions (manual
and automatic transcriptions) coming from different human
interaction scenarios. Relevant points will be:
Comparing the performances of the systems dealing with
both types of transcriptions.
Measuring the loss of each system due to the state of the
art ASR technology.
In general, motivating and driving the design of novel and
robust factual QA architectures for automatic speech
The proposed evaluation of QA on automatic speech transcriptions can be best understood from the perspective of the target application: searching audio streams with natural language questions. In this application, the input is a written question that is matched against the automatic transcriptions generated behind the scenes for all the audio streams available. However, even though the QA system searches automatic transcriptions, the output made available to the user is start/end pointers in the audio stream where the exact answer is located.
Consider the following example: one audio stream contains the information "Jacques Chirac went to Berlin" and the user wants to know where the French president has been: "Where did Jacques Chirac go?". If perfect transcriptions of the audio stream were available, this example would have an obvious solution and the whole problem would be no different than regular QA on written text. However, consider the case when the automatic transcription of the above stream contains two errors: "went" is transcribed as "ate" and "Berlin" as "Barcelona". Hence the automatic transcription of the full stream is: "Jacques Chirac ate to Barcelona". In this case, the correct answer to be extracted is "Barcelona", because this is the text that points to the correct answer in the audio stream.
The above example illustrates the two principles that guide
The questions must be generated considering the exact information in the audio stream regardless of how this information is transcribed, because the transcription process is transparent to the user. In other words, in the above example, the question should focus on where did the president go, rather than what he ate, which was the ASR error.
The answer to be extracted (hence the answer to be annotated in the automatic transcription) is the minimal sequence of words that includes the correct exact answer in the audio stream. In the above example, the answer to be extracted from the automatic transcription is "Barcelona", because this text gives the start/end pointers to the correct answer in the audio stream.
This pilot track could be extended in
several directions in the next years. This will be
done bearing in mind the interests of the
participating research groups. Possible directions
Adding oral questions. Most potential users of QA systems expect questions in oral media, given that this is the most natural way of human interaction. Dealing with oral questions requires different procedures that those used for written questions. This is mainly due to the differences between spontaneous speech transcriptions and classical written text explained above. The evaluation of robust approaches to deal with oral questions will provide a reference point of this topic in the state of the art.
Extending questions to other languages different than English.
Adding dialog-driven question answering. Often,
the answer to a query is not satisfactory to the
user. This can occur when dealing with queries
difficult to be interpreted or difficult to be
answered in the given document collection. A possible
way to proceed is to use Dialog techniques in order to
give the possibility to iteratively refine the
questions by taking into account the answers provided
by the QA system. However, when the input documents
are spontaneous speech transcriptions and oral
questions can occur, the possibility of
missunderstanding increases, and Dialog techniques
may be insufficient within QA systems. Evaluating how
well these techniques perform in the context of dialog-driven QA for
speech transcriptions is beyond the state of the art, and it would
provide a reference point for the research on robust techniques for
Extending questions to other types different to factual question (definitional questions, opinion questions, etc)