The vast amount of current information that are readily available online leads
to the problem of information overload due to the large quantity of irrelevant
and redundant information contained in these documents (Kwee
and Tsai, 2009). This information can be in the form of blogs (Tsai
and Chan, 2007), social networks (Tsai et al.,
2009), mobile information content (Tsai et al.,
2010a), databases (Ong et al., 2009) and
even Web services (Yee et al., 2009). To effectively
alleviate this problem, Novelty Mining (NM), or novelty detection, has been
proposed to retrieve novel yet relevant information, based on the specific topic
defined by a user. In past studies (Tsai and Chan, 2010a;
Zhang et al., 2002), novelty was defined as the
opposite of redundancy. Given a set of relevant documents, any document which
is very similar to any of its history documents is regarded as redundant. Therefore,
most of the later contributions have been made to sentence-level novelty mining
(Allan et al., 2003; Kwee
et al., 2009; Tsai and Chan, 2010b; Zhang
et al., 2010). This study surveys intelligent novelty mining techniques
for overcoming the information overload problem.
NOVELTY MINING TECHNIQUES
Novelty mining detects novel information in relevant documents in a given topic.
The major components in novelty mining are (1) novelty scoring, (2) novelty
decision making and (3) performance evaluation, as shown in Fig.
1. A novelty score is a calculated value that determines the novelty of
a document and depends largely on the novelty metric that is selected. A novelty
metric is a function which defines a distance between points in space.
||Novelty mining components
Many metrics used in Information Retrieval (IR) have been adopted for novelty
mining based on the assumption that novelty is the opposite of similarity or
redundancy (Tsai and Chan, 2010a). The novelty of any
document is quantitatively measured by the novelty metric based on its history
documents (the documents that the user has read) and represented by the novelty
score. The final decision on whether a document is novel or not depends on whether
the novelty score falls above or below a novelty decision point, which is the
boundary value or threshold that determ.
Novelty scoring: For comparison of a relevant document to its history
documents, several different geometric distance measures, or metrics, can be
used, such as Manhattan distance and cosine distance metric. Depending on whether
the ordering of the documents is taken into consideration, metrics can be either
symmetric or asymmetric. Symmetric metrics, like cosine and Jaccard, yield the
same result regardless the ordering of two documents. However, the results of
asymmetric metrics, such as new word count and overlap, are based on the ordering
of two documents.
In order to measure the degree of novelty directly, the similarity score is
converted to the novelty score simply by (1-similarity score). There are two
standard themes of comparison techniques in previous works. One of them is one-to-one
comparison, where the current document is compared with each of the previous
documents, then, the maximum of the redundancy (or similarity) scores obtained
will be compared against a decision point (a) to finally decide whether the
current document is redundant (Zhang and Tsai, 2009b).
If the maximum redundancy score exceeds a, the current document is detected
as redundant. The other theme is all-to-one comparison, where the current document
is compared to the pool of all the previous documents, in order to generate
the redundancy score. The simple similarity method and the overlap method are
based on the one-to-one comparison paradigm, while the all-to-one paradigm is
adopted for the simple pool method and the interpolated aggregate smoothing
Novelty decision setting: After obtaining the novelty score of the incoming document, the system will make a final decision on whether a document is novel or not based on the novelty decision point. If the novelty score of the document is above the novelty score decision point a, the document is considered as novel.
Performance evaluation: The F-score is a popular evaluation measure
that is used for evaluating the results of novelty mining (Zhang
et al., 2010), as well as information retrieval for measuring search,
blog classification and query classification performance. The F-score (also
F1 score or F-measure) considers both the precision and recall to compute the
score: P is the number of correct results divided by the number of all returned
results and R is the number of correct results divided by the number of results
that should have been returned. The F1 score can be interpreted as a weighted
average of the precision and recall, where an F1 score reaches its best value
at 1 and worst score at 0, which means, the higher the F-score is, the more
accurate the test is. Thus, F-score should be maximized.
Precision, recall and F-score are used to evaluate how well the system performs
in detecting novel documents. On the other hand, to find out how well the system
can detect non-novel or redundant documents, Redundancy-Precision (RP), Redundancy-Recall
(RR) and redundancy F-score (RF) can be used instead (Tsai
and Zhang, 2010).
Based on all the topics' P, R, F, RP, RR and RF, the average performance for documents can be obtained by calculating the arithmetic mean of these scores.
Sentence-level novelty mining: Many studies related to sentence-level
novelty mining, or novelty mining, originated from the TREC Novelty Tracks (Harman,
2002; Soboroff and Harman, 2003; Soboroff,
2004). The novelty track was introduced in the eleventh Text REtrieval Conference
(TREC) in 2002. This track was designed to investigate systems' abilities in
locating new and relevant information from a given document set which is categorized
into topics. Note: although the TREC Novelty Track uses the term novelty detection,
the use of novelty mining (Kwee and Tsai, 2009; Zhang
and Tsai, 2009a; Zhang et al., 2010) is preferred
in order to avoid confusion with novelty detection in outlier detection (Hoffmann,
2007; Markou and Singh, 2003), which is based on
one-class classification and is very different to novelty detection of text.
The novelty track document set was developed using the AQUAINT collection.
The task of novelty mining was clearly defined as follows: given a topic and
an ordered set of documents segmented into sentences, return sentences that
are both relevant to the topic and novel given what has already been seen (Soboroff,
2004). To fulfill the task, there are essentially two steps. The first is
identifying relevant sentences, which is essentially a passage retrieval task.
The second step is detecting those relevant sentences containing enough novel
The following four tasks allowed the participants to test their approaches
to novelty mining. These four tasks were used both in 2003 and 2004.
||Task 1: Given the set of documents for the topic, identify
all relevant and novel sentences
||Task 2: Given the relevant sentences in all documents, identify
all novel sentences
||Task 3: Given the relevant and novel sentences in the first 5 documents
ONLY, find the relevant and novel sentences in the remaining documents
||Task 4: Given all relevant sentences from all documents and the
novel sentences from the first 5 documents, find the novel sentences in
the remaining documents
As the main objective is to identify all novel sentences, this survey focused
on task 2. F-score (F) was the primary performance measure that was used to
evaluate the systems' novelty mining capabilities in this task.
||Statistics of benchmark sentence-level data
Novelty benchmark datasets
TREC sentence-level novelty data: Currently, the possible public datasets
for sentence-level novelty mining are from the TREC Novelty Track 2002 to 2004,
including three datasets, TREC 2002, 2003 and 2004 Novelty Track data. The TREC
2002 Novelty Track data is usually excluded because it contains too few redundancies
and 23 out of the 50 topics consist of all relevant sentences marked as novel.
Therefore, a baseline algorithm that marks every sentence as novel will perform
almost perfectly for this type of dataset. Moreover, this dataset is unrealistic
in the real world, where there are many redundancies in sentences and documents.
TREC 2003 and 2004 Novelty Track data are developed from the AQUAINT collection.
The news providers of the document set are Xinghua English (XIE), New York Times
(NYT) and Associated Press Worldstream (APW). Both relevant and novel sentences
from the National Institute of Standards and Technology (NIST) TREC document
set are selected by TREC's assessors. The statistics of these two datasets are
summarized in Table 1, where the novelty percentage is the
percentage of novel sentences in the dataset.
APWSJ document-level novelty data: For document-level novelty mining,
the APWSJ novelty data (Zhang et al., 2002) is
the public dataset that consists of news articles from Associated Press (AP)
and Wall Street Journal (WSJ). There are 50 topics from Q101 to Q150 in APWSJ.
Previous studies (Zhang and Tsai, 2009a) excluded the
5 topics (Q131, Q142, Q145, Q147, Q150) which lack human redundancy assessments
(all documents were novel). The assessors provide two degrees of judgements
on non-novel documents, absolute redundant and somewhat redundant. Most experiments
adopt the strict definition used by Zhang et al.
(2002) where only absolute redundant documents are regarded as non-novel.
There are 11896 documents on 50 topics. After sentence segmentation, these documents
have 319616 sentences in all. The APWSJ data contains a total of 10839 (91.1%)
novel documents and 1057(8.9%) non-novel documents.
TREC 2003/4 document-level novelty data: For document-level novelty
mining, another two datasets were created from the sentence-level TREC 2004/2003
Novelty Tracks, namely document-level TREC 2004 and document-level TREC 2003
(Tang et al., 2010; Tsai and
||Statistics of BizBlogs document-level data
In order to obtain the documents, the sentences were first combined into documents
according to their document id.
Because we already have the ground truth for the novelty of each TREC sentence, we can easily calculate the NovelRate for each document, which is the actual percentage of novel sentences in that document. If we set a low NovelRate decision point, most documents in this dataset are considered to be novel. This means that this is a dataset with a high percentage of novel documents. When setting different NovelRate decision points, we can observe performances on datasets with different percentages of novel documents.
BizBlogs document-level novelty data: The novelty BizBlogs dataset (Liang
et al., 2009), a novelty annotated version of the BizBlogs07 (Chen
et al., 2007) dataset of business blogs, can also be used in novelty
mining experiments. The dataset contains 1269 blog posts in four categories:
product, company, marketing and finance (Chen et al.,
2007). The product category deals with description or review of specific
company products, as well as other product-related news. The company category
deals with news or other information of the corporations, organizations or business.
The marketing category includes blogs talking about the marketing, sales and
advertising strategies of a company. The finance category relates to financing,
funding and loans, financial statements, cash flow and credit information (Chen
et al., 2007).
Table 2 summarizes the percentage of novel blogs for the four categories as well as the overall blogs. As seen from the table, the majority of the blogs were assessed as novel, which implies that there is little overlap in the content of the blogs for this particular dataset.
INTELLIGENT NOVELTY MINING TECHNIQUES
Considering the diverse and changing scenario in the real world, this section
describes techniques for intelligent novelty mining by bridging the gap between
the existing novelty mining methods and user performance requirements. Intelligent
novelty mining addresses the domain-specific problem of mining novel information
from text data with specific regard to the user context and aims to balance
the technical significance and business concerns to create techniques that are
useful in real-world scenarios. These techniques aim to adapt to the users'
desired level of novel information and human interaction. By addressing the
issues of intelligent novelty mining, the techniques are useful from both the
technical and business perspectives.
Metrics and models for novelty scoring: In a past study, a thorough
comparative study was performed on different types of novelty metrics, symmetric
(i.e., cosine and Jaccard similarities) and asymmetric metrics (i.e. new word
count and overlap) (Tsai et al., 2010b, c).
Complementary behavior was observed in the symmetric and asymmetric novelty
metrics and a new framework of novelty measurement, a mixture of both types
of novelty metrics, was proposed. The experimental results showed the superior
performance of this new framework under different performance requirements and
for data with different percentages of novel sentences (Tsai
et al., 2010c). This method effectively avoided the significant performance
drop compared to using individual metrics in either high-recall or high-precision
novelty mining algorithms. Furthermore, this new framework is convenient to
be applied for the novelty mining algorithms with a series of individual novelty
metrics. Moreover, because it does not require any prior information from data,
it is very suitable to the real Furthermore, another novelty model, document-to-sentence
(D2S), was proposed for document-level novelty mining (Tsai
and Zhang, 2010). This model can make document-level novelty mining more
effective by adopting the techniques for the sentence-level. Experiments on
document-level APWSJ data show that D2S can significantly improve the document-level
novelty mining performance in terms of redundancy-precision and redundancy-recall,
achieving higher redundancy-precision and redundancy-recall than the cosine
benchmark. Through experiments on document-level TREC 2004 and TREC 2003 datasets,
they also found that it was significantly better to use D2S when the percentage
of novel documents is high.
Other studies utilized models for novelty scoring with named entity extraction
and Part-of-Speech (POS) tagging (Ng et al., 2007;
Zhang and Tsai, 2009b). Ng et
al. (2007) determined the novelty score of each sentence by using two
different metrics, Unique comparison and Importance value. Unique comparison
calculated the number of matched entities, whereas Importance value took into
account the total weight of matched words that coexisted in both the test and
history sentences. The results were promising when compared to the benchmark
scores for TREC 2004. Zhang and Tsai (2009b) proposed
a new mixed method that treated the sentence novelty score as the novelty score
between entities of sentences and the novelty score between other significant
words, which can improve the novel sentence detection performance.
Adaptive novelty decision: Tang and Tsai (2009)
addressed the important problem of setting an adaptive decision point by utilizing
the user's feedback over time, which has rarely been addressed in novelty mining.
An algorithm was proposed which can be tuned according to different performance
requirements varying from high-precision to high-recall, by combining with different
In the experimental study, the novelty mining algorithm was tested on both document-level and sentence-level novelty mining data. With complete user feedback, the experimental results showed the promising performance of the adaptive decision setting for novelty decision making for a real-time novelty mining algorithm. In order to test the adaptive decision setting on a more practical level, the algorithm was tested with partial feedback. The experimental results indicate that the adaptive decision setting can work robustly with partial feedback from the user. The adaptive decision setting can therefore be employed for realistic situations of varying performance requirements (high-precision/recall) as well as varying degrees of user feedback.
In another set of experiments, the novelty mining algorithm employing the adaptive
decision setting algorithm has been tested on the experimental datasets with
complete user's feedback on data with low, medium and high novelty ratios (percentage
of novel sentences/documents) (Tang et al., 2010).
The experimental results show that the adaptive decision setting is very effective
in finding the best decision point in the novelty mining algorithm. Therefore,
the original algorithm the adaptive decision setting has solved the important
problem of adapting to different performance requirements of different topics
and users, which is a significant issue for intelligent novelty mining.
Users' context for novelty feedback: Recently, there has been increased interests in exploiting contextual information for data mining. Contextual aspects may intervene in several steps of the novelty mining process and it is important to define how these aspects interact within the process and the forms that this can take. During the novelty mining task, context may be used as constraints to allow the development of more efficient algorithms. Context information and domain knowledge may also intervene in the post-processing step to help in the explanation of results.
An important issue in context-aware novelty mining is performance evaluation.
It is commonly accepted that the traditional evaluation methodologies used in
TREC may not always be suitable for considering the contextual dimensions in
the novelty mining process. Indeed, laboratory-based or system oriented evaluation
is challenged by the presence of contextual dimensions such as user interaction,
profile or environment which significantly impact on the relevance judgments
or usefulness ratings made by the end user. Therefore, we need to understand
how to overcome the challenge of user-oriented evaluation and to design novel
evaluation methodologies and criteria for contextual information retrieval evaluation.
For example, in the TREC Novelty Track, the F score was used as the primary
evaluation measure (Soboroff, 2004) which actually assumes
that the user wants to keep a balance between returning the most novel information
(i.e. high-precision case) and not missing any novel information (i.e., high-recall
A user's context in novelty mining is defined as his/her specific requirements, which can be one or more of the following:
Level of novelty: Different users have different definitions of novel information. For example, a user would regard a sentence with 50% novel information as a novel sentence while another user would only regard a sentence with 80% novel information as a novel sentence. The threshold of novelty scores should be higher for the user with a stricter definition for the novel sentence. As novelty mining is an accumulating system, more training information will be available for threshold setting, based on the user's feedback given over time.
Performance requirement: The user's performance requirements can tell
us the specific concerns about the novelty mining output. For example, when
the user does not want to miss any novel information, a high-recall system which
only filters out the most redundant sentences is desired. On the other hand,
when the user wants to read the sentences with the most significant novelty
in a short time, a high-precision system which only retrieves highly novel sentences
is preferred. It has been acknowledged that there is a trade-off between recall
and precision; thus, users need to decide on which performance measure to optimize.
The trade-off between recall and precision can be adjusted by the novelty threshold.
As high-F score is set as the primary performance requirement in previous studies
(Soboroff, 2004), based on the assumption that the user
wants to keep a balance between high-precision and high-recall. This assumption
may not work for the most of the users, but can be used if the user does not
know in advance.
Specific domain interest: Last but not least, understanding the user's context is very helpful for the system to set the mining target. For example, a user who is interested in the financial news may be sensitive to different numbers, currency symbols, etc. To make use of the user's contextual information, we have to answer several questions, i.e. what contextual information can be acquired in novelty mining, how to make automatic annotation about the context and how to design the user interface for the user's context.
Tang et al. (2009) addressed the problem of accessing
the user's context in novelty mining via novelty feedback, an important problem
that has not been addressed by traditional novelty mining algorithms, which
regard novelty as the opposite of similarity. The technique, contextual information
annotation of named entities algorithm, was proposed to solve the issue of detecting
novelty in highly similar sentences and was developed to interact with the user's
novelty feedback. The algorithm is useful for users interested in detecting
changes in number, location, person, or time and can be generalized for other
named entities as well. The linguistic-level information, i.e., named entities,
was used to accommodate the user's context. In the algorithm, the user is allowed
to feedback his/her preferred significant information to the algorithm directly.
The experiments on different data showed that the method can solve the important
problem of mining the novel information in highly similar sentences at the lingui.
Summary of intelligent novelty mining: In relation to previous studies, intelligent novelty mining techniques are more suitable for real-world novelty mining. Other techniques cannot adapt to different users who have different definitions of novel information, whereas the adaptive techniques can. Intelligent techniques can also address the differences in users' performance requirements, whether they require a high-recall system which only filters out the most redundant sentences is desired, a high-precision system which only retrieves highly novel sentences, or a combination of both. In addition, the techniques make use of the user's contextual information in novelty mining, which was not addressed in previous work. In summary, the intelligent techniques described in this study are able to leverage the use of novelty metrics, novelty decision and novelty feedback to improve the results of mining new information from text data to overcome the challenge of information overload.
This study reviewed intelligent novelty mining techniques for alleviating the information overload problem. Intelligent novelty mining addresses the domain-specific problem of mining novel information from text data with specific regard to the user context. Intelligent novelty mining aims to balance the technical significance and business concerns to create techniques that are useful in real-world scenarios. The techniques can leverage the use of novelty metrics, novelty decision and novelty feedback to improve the results of mining new information from text data.