Abstract: A novel non-generative probabilistic framework is proposed for extractive lecture speech summarization. Rhetorical structure hidden in lecture speech is one of the most underutilized characteristics. Rhetorical State Support Vector Machine (RSSVM) is proposed for automatically decoding rhetorical structure in speech and summarizing the speech data. RSSVM gives a 57.2% ROUGE-L F-measure, a 5.6% absolute increase in lecture speech summarization performance compared with the baseline system without using rhetorical information. It also outperforms the extractive summarization directly using the rhetorical structure produced by Rhetorical-State Hidden Markov Models.
INTRODUCTION
Automatic speech summarization is a core technology in the spoken document understanding and organization system. It is the process of digesting source speech data and producing understandable segments (in speech or transcription text form) that convey the most informative or relevant information as a substitute of the original speech. Speech summarization is young and under-exploited compared with the research field of text summarization. Unlike summarizing the written documents, summarizing spoken documents produced by automatic speech recognition system, often has a big challenge that is the lack of easily discernible structures in speech. Fonts, sentence/paragraph boundaries, title/subtitles and so on can be found and are helpful to describe the underlying meaning in the written documents. However, those kinds of structure clues do not exist in the speech data.
Acoustic and linguistic features were used for building extractive speech summarizers (Chen et al., 2006; Maskey and Hirschberg, 2005; Inoue et al., 2004; Murray et al., 2005). However, those summarizers all ignore rhetorical structure which exists in the spoken documents and the relevant speech data. Some researchers (Hori et al., 2002; Maskey and Hirschberg, 2003; Hirohata et al., 2005; Zhang et al., 2007) have suggested that rhetorical structure exists also in the spoken documents and relevant speech data and efficient representation of this information can be helpful to summarization task.
Fung et al. (2008) proposed Rhetorical-State Hidden Markov Models (RSHMM) for representing rhetorical structure existing in the lecture speech and RSHMM-enhanced probabilistic framework for extractive summarization. Here, this study further develops the probabilistic framework for improving summarization performance.
EXTRACTING RHETORICAL CHARACTERISTICS OF LECTURE SPEECH
Acoustic and linguistic features: Acoustic/prosodic features in speech summarization system are usually extracted from audio data. Researchers commonly use acoustic/prosodic variation-changes in pitch, intensity, speaking rate-and duration of pause for tagging the important contents of their speeches (Hirschberg, 2002). This study also investigates these features for their efficiency in predicting summary sentences on lecture speech data.
The acoustic feature set contains thirteen features: DurationI, SpeakingRate, F0I, F0II, F0III, F0IV, F0V, EI, EII, EIII, EIV and EV. These features are listed in Table 1.
Table 1: | Acoustic/prosodic features and feature
descriptions |
DurationI is calculated from the annotated manual transcriptions that align the audio documents. SpeakingRate is obtained by phonetic forced alignment by HTK. Next, F0 features and energy features are extracted from audio data by using Praat (Boersma and Weenink, 1996).
The lexical feature set contains eight features: LenI, LenII, LenIII, NEI, NEII, NEIII, TFIDF and Cosine. These features are described as follows:
• | Len I: The number of words in the sentence |
• | Len II and Len III: The previous/next sentences LenI value |
• | tf*idf; tf and idf defined as Eq. 1 and 2 |
• | Cosine: Cosine similarity measure between two sentence vectors |
(1) |
where, ni being the number of occurrences of the considered word and the denominator is the number of occurrences of all words in a story or meeting.
(2) |
where, |D| is the total number of sentences in the considered story or meeting. The denominator is the number of sentences where the word ti appears.
All lexical features are extracted from the manual transcriptions or ASR transcriptions. For calculating length features, The lecture speech transcriptions are segmented into Chinese words.
Rhetorical-state hidden Markov models: Fung et al. (2008) had proposed the supervised learning method, Rhetorical-State Hidden Markov Model (RSHMM) for extracting the rhetorical characteristics from lecture speech.
Each sentence of the given document D is annotated as i* which approximately
maximizes
(3) |
where, r() is a mapping function for the rhetorical unit, there are a total of R rhetorical units in a single document.
EXTRACTIVE SUMMARIZATION USING RHETORICAL STRUCTURE
A common summarization approach-extractive summarization-is adopted for composing a summary by selecting salient sentences or segments from a given speech. In this section, this study briefly describes RSHMM-enhanced SVM for lecture speech extractive summarization (Fung et al., 2008). Then a novel non-generative probabilistic framework-Rhetorical-State SVM- is proposed for further improving summarization process.
RSHMM-enhanced SVM: Based on the probabilistic framework, extractive
summarization task is equal to estimating
(4) |
where, c() is the salient sentence classification function; i* can be obtained by Eq. 3. Then sentence sj whether it is a summary sentence or not using a probability threshold is predicted.
(5) |
The summarizer is modeled by SVM classifier with Radial Basis Function (RBF) kernel as in Chang and Lin (2001). One SVM classifier is built for each rhetorical unit in the RSHMM network.
(6) |
Rhetorical-state SVM: For making further use of rhetorical information,
a novel probabilistic framework-Rhetorical State SVM is designed to complete
summarization task. Using conditional probability theory,
Considering that the RSHMM network contains R rhetorical units in total,
One RSSVM is built by (R+1) class SVM classifier with RBF kernel for
the whole RSHMM network. Besides, the
Finally, those sentences which satisfy the following criterion as summary sentences are predicted:
EXPERIMENTAL SETUP
The lecture speech corpus are collected containing wave files of 111 presentations recorded from the NCMMSC2005 and NCMMSC2007 conferences. Slides (Microsoft Power Point) and manual transcriptions are also collected. Each presentation lasts about 15 min on average. Each presentation was automatically divided into on average 220 segment units. The ASR system runs in multiple passes and performs unsupervised acoustic model adaptation as well as unsupervised language model adaptation (Chan et al., 2007) with 70.3% recognition accuracy.
EXPERIMENT RESULTS
In the summarization experiments, 70 presentations are adopted from the lecture speech corpus. Sixty presentations are used containing 3220 sentences as training data and the remaining 10 presentations of 458 sentences as test data. Importantly, sentence boundaries of all the transcriptions are re-segmented manually. One sentence maybe contains several segment units. Then each presentation contains about 50 sentences. The reference summaries are compiled based on the two assumptions: One assumption is that a good summary should consist of salient sentences from each of the rhetorical units (e.g. title, introduction, background, methodology, experiments and conclusion sections in a conference presentation); the other one is that slides (power point) sentences are good summaries of lecture speech presentations.
The extractive summarization experiments are performed on (R = 3) RSHMM network. Binary SVM classifier without rhetorical information are also built as the baseline system. The summarizers performance is evaluated by the metric ROUGE (Recall Oriented Understudy for Gisting Evaluation) which can measure overlap units between automatic summaries and reference summaries. ROUGE-L (summary-level Longest Common Subsequence) precision, recall and F-measure are used (Lin, 2004). The results are shown in Table 2.
Table 2: | Evaluation by ROUGE-L F-measure of summarization
performance on the manual sentence segmentation transcriptions using three-rhetorical-unit
RSHMM network |
Baseline: The single SVM without rhetorical
information, RSHMM: RSHMM-enhanced SVM, Ac: Acoustic Features and Le:
Lexical features |
This study finds that the Rhetorical-State SVM summarizer consistently outperforms the performance of the base-line system and also outperforms that of RSHMM-enhanced SVM summarizer. In addition, the RSHMM-enhanced summarizer consistently outperforms the performance of the baseline system. In the previous work (Fung et al., 2008), the same conclusion is obtained. Furthermore, the best performance is achieved by the Rhetorical-State SVM summarizer based on acoustic and linguistic features: ROUGE-L F-measure of 0.572, 5.6% higher than the best performance produced by the baseline and better than the RSHMM-enhanced SVM summarizer.
Besides, from Table 2, the study also finds that good performance can be achieved by the RSSVM summarizer, only using lexical features extracted from automatic transcriptions with 70.3% ASR accuracy, ROUGE-L F-measure of 0.561. As such, the RSSVM summarizer can decrease the effect of recognition errors on extractive summarization compared with RSHMM-enhanced SVM summarizer.
CONCLUSION
This study has presented a novel probabilistic framework-Rhetorical-State SVM for extractive summarization on lecture speech. RSSVM can automatically decode the underlying rhetorical information in lecture speech and summarize the speech data. In this framework, the RSSVM summarizer produced ROUGE-L F-measure of 57.2% which represents a 5.6% absolute increase in lecture speech summarization performance compared with the baseline system without using rhetorical information. The RSSVM summarizer also outperforms RSHMM-enhanced SVM which is directly built on the rhetorical structure extracted by RSHMM. Besides, only using lexical features extracted from automatic transcriptions, the RSSVM produced good performance: ROUGE-L F-measure of 0.561. This finding suggested that the RSSVM summarizer can reduce the effect of recognition errors on extractive summarization compared with RSHMM-enhanced SVM summarizer.
ACKNOWLEDGMENT
This study is supported by the State Key Program of National Natural Science Foundation of China (U0935003), the Natural Science Foundation of Guangdong Province of China (Grant No. S2012040007560) and the Foundation of Guangdong Educational Committee (Grant No. 2012KJCX0099).