Searching On-Line Literature Digital Libraries (OLDLs) efficiently and effectively is becoming more and more important as the size and use of OLDLs expand at a very high rate. Consider the following three OLDLs as examples from computer life sciences and from electrical engineering fields:
||In computer science, ACM digital library (ACM) has around
one million full-text publications collected over fifty years, all available
to search and download
||In electrical engineering and computer science, IEEE xplorer (IEEE), is
another OLDL that provides its users with access over the web to more than
1,700 selected conferences proceedings
||ScienceDirect (ScienceDirect) is the worlds leading scientific,
technical and medical information resource that celebrated its billionth
article download back in November06 since, it has been launched and
put into service in 1999
From the above stated numbers one may come to a conclusion that providing accurate
publication importance scores for search results and ranking publications returned
as search results accurately can significantly help OLDL users in reducing the
time they spend in searching OLDLs. Furthermore, accurate and effective publication
rankings can also be useful for comparative assessments of publications as well
as publication venues and research institutes such as universities. Yet more,
properly ranking authors publications may help in comparatively evaluating
scientists as well.
At the present time, OLDLs lack effective and accurate publication rankings
(Ratprasartporn et al., 2007). For instance,
the ACM Digital Library returns unexplained rankings of publication search results
that make this ranking not useful to users (ACM). Moreover, search output results
of OLDLs tend to experience high level of the topic diffusion problem, which
is defined as having large number of search results from multiple topics that
are not of the current users interest (Ratprasartporn
and Ozsoyoglu, 2007; Voorhees and Buckley, 2002;
The topic diffusion problem occurs because keyword-based searches produce a
large number of publications over a relatively large number of topics, thereby
producing publication importance scores that are non-specific to topics (Ratprasartporn
and Ozsoyoglu, 2007; Voorhees and Buckley, 2002;
Using social networks or bibliometrics, a number of publication score functions
has been defined in literature (Brin and Page, 1998;
Kleinberg, 1998; Bani-Ahmad et al., 2005a).
In Bani-Ahmad et al. (2005b), the authors have
comparatively evaluated several citation-based publication score functions,
including, (1) PageRank proposed Brin and Page (1998),
(2) Authorities scores proposed in Kleinberg (1998),
both adopted from the www research domain and (3) citation-count scores from
the bibliometrics research domain (Chakrabarti, 2003).
Bani-Ahmad et al. (2005a, b)
observed that all those three score functions suffer from the separability problem,
that is; none of these scoring functions assigns scores that distribute well
over a given scale, e.g., [0, 1]. Instead, scores distributions of the three
experimented publication score functions are found to be highly skewed (Bani-Ahmad
et al., 2005a, b) and decay very fast (Redner,
2004; Bani-Ahmad et al., 2005a, b),
resulting in a much less useful comparative publication assessment capability
This lack of separability is caused by the rich gets richer phenomena identified
in (Redner, 2004; Li and Chen, 2003).
The rich gets richer phenomena involves observing a very small number of publications
with relatively high numbers of in citations. Those highly-cited publications
have even higher chances of receiving new citations. Further studies show that,
yet, these citation-based scoring functions are also not very accurate, probably
caused by topic diffusion in search outputs (Haveliwala,
The research-evolution model proposed by Aya et al.
(2005) suggested that citation relationships between research publications
produce multiple, small pyramid-like structures, where each pyramid represents
a set of publications that are related to a highly specific research topic.
A research pyramid is defined (Aya et al., 2005)
as a set of publications that represent a highly specific research topic and
usually has a pyramid-like structure in terms of its internal citation graph
(Aya et al., 2005).
Publications within an individual research pyramid are: (1) motivated by earlier
publications in the topic area (e.g., this paper is motivated in part by citations
(Ratprasartporn et al., 2007; Aya
et al., 2005), or (2) use techniques proposed in publications from
other research pyramids (e.g., this study in part uses some of the techniques
presented in citations (Brin and Page, 1998; Kleinberg,
1998)). Other reasons for citations may also be observed (Aya
et al., 2005).
In this study, our goals are to (1) provide a solution to the OLDL search output ranking problem due to the topic diffusion problem, by grouping search outputs at the most-specific (detailed) topic level and without identifying the topics themselves, (2) eliminate the low separability problem of score functions and (3) improve the accuracy of three score functions, namely, PageRank, authorities and citation count score functions. Our approach uses the research pyramid (RP-) model to improve the separability and accuracy of publication scores and is based on normalizing publication scores within a limited scope, namely, within individual research pyramids. These improvements come from the fact that publications are now compared to their peers within their peer groups, namely, their own research pyramid publications that are on the same topic.
This study proposes and empirically evaluates two approaches to identify research
pyramids. The first, called LB-IdentifyRP, uses link-based research pyramid
identification, which captures research pyramids by identifying pyramid-like
structures from the citation graph of the publication set. The second approach,
called PB-IdentifyRP, uses proximity-based research pyramid identification,
utilizes a graph-based proximity measure, namely SimRank (Jeh
and Widom, 2002), to compute similarities between publications and then
restructures the k-most-similar publications into a research pyramid.
This studys contributions are:
||Validate the research pyramid model of research evolution
||Propose and evaluate two algorithms to identify research pyramids
||Improve publication scores in terms of accuracy and separability via publications
||As a testbed, we have utilized AnthP, a publication set of 14,891 publications
from the ACM SIGMOD Anthology. Our experimental results show that:
||The complete publication citation graph (of AnthP) is highly clustered
||Each cluster of the complete publication set has a pyramid-like structure
in terms of the citation graph of the cluster
||Each cluster represents a highly specific research topic. Note that the
above three findings validate the research pyramid model proposed by Aya
et al. (2005)
||Topic similarities decay over both the citation age and citation paths
||We used the two topic similarity decay curves to guide the RP construction:
||Within RP citation graphs, the average number of in citations per paper
varies, pointing to the importance of comparative publication scores within
||Publication scores within RPs are accurate, due to our approach where
each publication is compared only to its peer (research pyramid paper) group
CITATION-BASED PUBLICATION SCORES
Existing citation-based publication score functions are all based on the notion
of prestige in social networks (Wasserman and Faust, 1994)
and bibliometry (Chakrabarti, 2003). In this study, as
publication score functions we use:
PageRank (Brin and Page, 1998) algorithm:
PageRank score PPageRank of a publication P is recursively computed as the normalized sum of PageRank scores of documents citing P.
Authority score of the HITS (hyperlink induced topic search) algorithm (Kleinberg, 1998):
Each document P gets two scores, namely hub and authority scores. Hub score of P is computed by summing up authority scores of the publications that P cites and the Authority score of P, denoted by PAuth, is computed by summing the hub scores of publication citing P.
Normalized citation count score:
For a particular paper P that receives CP citations, the normalized citation count PCitCount is the ratio of CP to the number CPmax of in-citations of the most cited paper in the publication set.
Figure 1a-c show that the three score functions,
namely, PPageRank, PAuth and, PCitCount are
highly skewed and do not separate scores well. Notice that the papers that are
cited the most have the score of 1.0. Those papers are very few (less than one
percent). The majority of scores cluster around the 0.1 value. This is because
that, in the publication set used, 73.2% of the papers have received two citations
or less. Thus, the majority of the publication set papers has received low scores
that cluster around the 0.1 value.
Pan (2006), the author observed the skewness and inseparability
of these functions independently in computer science and life sciences publications
(70,000 documents in each) as well. And, it is shown (Redner,
2004; Li and Chen, 2003) that distributions of citation-based
score functions are also highly skewed and decay very fast. We think that the
cause is topic diffusion since scores are computed with respect to the full
publication set. By using the research-pyramid model proposed by Aya
et al. (2005), we normalize scores of publications within their own
research pyramids, which allows for a fair comparative assessment of publications
as publications are compared to their peers in their own research pyramids.
||Histograms of (a) CitCnt, (b) Auth and (c) PageRank. Score
distribution of the three publication score functions. Publication set used
consists of 15,000 publications from ACM Anthology all from the domain of
PROPERTIES OF RESEARCH PYRAMID MODEL
We have observed three properties of research publications in three separate
data sets, namely, ACM Anthology (AnthP; 15,000 publications) (Al-Hamdani,
2003) and computer sciences and life sciences publication sets (each with
70,000 publications) (Pan, 2006).
Property 1 (maximum citation age):
In online digital libraries (OLDLs), most publications receive most of their in-citations within a fixed number of years after their publication dates. We refer to this value as the Maximum Citation Age and denote it by CAgeMax.
We have observed (Bani-Ahmad et al., 2005a,
b; Pan, 2006) that, in AnthP and
Computer Sciences and Life Sciences OLDLs, most publications receive 90% of
their in-citations in 10 years after they get published, i.e., CageMax
= 10. Below in Property 4, we give a tighter bound for citation age within which
topical similarity within an RP is maintained between citing and cited publications.
Figure 2 presents the citation age distributions in AnthP.
We noticed that within ten years after their publishing date, publications receive
90% of their citations, that is, it is highly unlikely that publications receive
new citations after 10 years of its publication dates. The Fig.
2 also shows that most papers receive top popularity and awareness levels
after 5 years of its publications.
In rare cases, publications may cite works older than CAgeMax. It
is found (Ahmed et al., 2004; Case
and Higgins, 2000) that a great proportion of these citations are for historical
reasons, which we interpret as: old cited works (1) have coarse similarity to
citing papers and (2) do not belong in the RP of the citing publication.
Property 2 (topic specificity over time):
Scientific research publications quickly become very topic-specific over time, usually referable via a highly specific topic.
As shown in Fig. 3, an old research pyramid that covers a
certain research topic leads to instantiations of new research topics and thus
to creations of new RPs, that use techniques proposed in the publications of
parent RP(s). Again, such old citations carry topical similarity between the
citing and cited publication at a coarse granularity level. Possible citation
exchanges between different RPs also occur and are of type uses, i.e., the citing
paper uses techniques proposed by the cited paper.
|| Citation age distribution curve of AnthP
|| The RP-based model
Codds paper E. F. Codd, A Relational Model of Data for Large Shared
Data Banks, Commun. ACM 13(6): 377-387(1970) is about the topic relational model
and cited around 580 times. A new and more specific topic of 2000s (i.e.,
citation to Codds work is 30+ years old), say, rank-aware join algorithms,
is coarsely related to the more general topic relational model in that, a publication
P in the RP of rank-aware join algorithms and citing Codds paper uses
the techniques proposed in the RP of the relational model.
Property 3 (topic similarity decay over citation path):
After very small citation path distances, topical similarity between papers decays significantly.
From Fig. 4, in AnthP, after a citation path of length 3,
the topical similarity, as measured by SimRank, significantly decays. We refer
to this value by LMax-TopicDecay. This observation led us to build
RPs of height at most 3 in the experimental results section.
Property 4 (topic similarity decay over citation age):
After a certain citation age, topical similarity between the citing and the cited papers significantly decays.
||SimRank score change with citation distance
|| SimRank score change with citation age
From Fig. 5, in the AnthP set, after a citation age of about
5 years, the topic similarity between the citing and cited papers decays significantly.
We refer to this value by CAgeMax-TopicDecay. This observation led
us to build RPs in the experimental results section such that the maximum citation
age within an RP is 5 years.
Next we present the two characteristics that identify a research pyramid RP.
RP-property 1 (high topic specificity):
An RP, usually organizable into a pyramid, is a set of publications that represent
a highly specificresearch topic. We maintain high topic specificity of RPs by
applying properties 3 and 4 and keeping the height of research pyramids low
(property 3). Note that we make no attempts to identify the topic associated
with an RP, as our approach does not need the topics explicitly. But, in interactive
environments, providing topics to users is useful (Ratprasartporn
and Ozsoyoglu, 2007).
RP-property 2 (research pyramid construction):
RPs are arranged into pyramid structures either directly by using citation
graphs (i.e., the link-based approach) (Aya et al.,2005)
or indirectly using the publication times and close proximity of papers (i.e.,
the proximity-based approach).
RESEARCH PYRAMID IDENTIFICATION PROCEDURES
Based on the properties of publications and characteristics of RPs, next we propose two offline research pyramid identification procedures, namely, the Link-Based (LB) and the Proximity-Based (PB) RP identification procedures.
Both procedures start by choosing a candidate root node for an RP, called the cornerstone paper. The paper that is located at the root of a research pyramid receives more citations than others as other publications within the research pyramid are motivated by it and directly or indirectly cite it. Thus, our approach is to identify papers with high in-citations as cornerstone papers (i.e., the roots) of RPs to be constructed.
The link-based procedure locates research pyramids by identifying pyramid-like
structures in the citation graph of the publication set. In summary, within
an individual RP, publications are topically related (Aya
et al., 2005) and motivated by each other (Fig. 3)
(Aya et al., 2005) and we use the four properties
of section 3 to identify citations within RPs-as summarized next.
In AnthP, the average number of citations to a paper (in-citations), denoted by CI, is 2.066. Note that, in our experiments, we consider only the AnthP citations that are completely within AnthP; any citation from a paper within AnthP to a paper that is not in AnthP is removed. Using Property 3 and RP-Property 1, we limit RP heights to 3. Thus, the expected number of papers within a research pyramid RPP with paper P as the root and with height 3 is |RPP| = 1 + CI + CI2 + CI3 15. Of course, the actual identified RP sizes (the number of papers in RPP) vary. Some RPs may deal with active research topics and, in such cases, the number of in-citations of publications are noticeably higher than CI, leading to noticeably higher RP sizes as well.
Figure 6a presents the link-based LB-IdentifyRP() procedure
that utilizes citation-relationships between publications to identify the research-pyramid
structures of the publication set at hand. The procedure LB-IdentifyRP() (1)
selects a cornerstone paper P from the existing publication set (originally,
say, AnthP) as an RP root, by simply picking the current most-cited publication
(only citations that are CAgeMax-TopicDecay old according to property
4 above), (2) calls LB-FormRP() to locate the RP set RPP of P and
(3) eliminates RPP from the current publication set CurrAnthP and
repeats (a)-(c) again, until no more publications are left in CurrAnthP.
||Functions of LB- and PB-IdentifyRP algorithms. (a) procedure
LB-IdentifyRP, (b) function ChooseRoot, (c) function LB-FormRP() and (d)
Note that our approach in this paper is to create distinct and nonoverlapping
research pyramids. An alternative approach is to allow overlapping research
pyramids as follows: Do not eliminate any papers from the original publication
set (i.e., remove step (c) above); instead, simply color each selected publication
and continue until all publications are colored, meaning that, when the algorithm
ends, each paper belongs to at least one RP set and possibly more.
The two main functions of the link-based LB-IdentifyRP() procedure are ChooseRoot()
and LB-FormRP(). ChooseRoot() (Fig. 6b) chooses publications
that are cornerstone papers, or roots of research pyramids. The function LB-FormRP()
(Fig. 6c) forms the RPP of a root publication P
by adding direct citers of P (i.e., level-1 citers) into RPP and
indirect citers of P at a level up to the LMax; in experiments, we
choose LMax as 3, by following the property 3. The function citers
(P, l, CAgeMax-Topic-Decay) returns the set of publications that
cite P at a level l (which is at most LMax) where the citation age
of the citing paper with respect to P is less than the maximum citation age
CAgeMax-Topic-Decay, (Properties 1 and 4). In more detail:
||Paper-id pidP of root P along with its level 0
is inserted into RPP and the queue Q, which holds paper-ids for
future expansions and their distances to the root paper P
||Two-tuple <Pi, l> in Q is dequeued and expanded by locating
direct or indirect citers of Pi so long as their levels with
respect to P is at most LMax-TopicDecay (i.e., 3) and their citation
age with respect to P (the root) is less than the maximum citation age CageMax-TopicDecay
(i.e., 5). All expanded publications and their level info with respect to
P are inserted into the queue Q
||The above two steps are repeated until Q is empty; then RPP
The proximity-based PB-IdentifyRP() is similar to the link-based, except that
the function call to LB-FormRP() is replaced by the function call PB-FormRP().
The function PB-FormRP() (Fig. 6d) of the proximity-based
approach utilizes a graph-based proximity measure, namely SimRank (Jeh
and Widom, 2002), to compute similarities between publications. It captures
RPP of the root publication by locating publications that are most
similar to P and yet (a) are linked to P with a citation path length of at most
LMax-TopicDecay and (b) have a citation time distance less than CAgeMax-TopicDecay.
SimRank iteratively computes similarity scores between nodes in a graph G following
the rule that two nodes are similar if they are linked with similar nodes. In
other words, the SimRank similarity between two nodes a and b, S(a, b), is iteratively
computed using the formula (until the similarity scores converge):
where, I (a) and I (b) are sources of in-links of a and b, respectively. C
is the decay factor between 0 and 1. We choose C = 0.8 (Jeh
and Widom, 2002). If |I (a)|or|I (b)| = 0 then S(a,
b) = 0 by definition, in the case where a = b, S (a, b) = 1. The space complexity
of the naive SimRank algorithm is O(N2) where N is the graph size
(the citation graph in publication domain). We prune as in Jeh
and Widom (2002) by considering node pairs that are near each other in the
range of radius r. We choose r = 6, which is twice the value of the expected
research pyramid height as also explained in earlier.
PB-FormRP() receives as input the root P, the maximum level LMax from root and utilizes the maximum citation age CAgeMax-TopicDecay (as 5) and returns the RP set RPP of publication P following the same main steps of LB-FormRP() with one main difference: the way the two-tuple <Pi, l> dequeued from Q is expanded, as follows:
||Top |Citers(Pi ,l,CAgeMax-TopicDecay)|
similar papers, based on SimRank, to Pi are identified. The number
of citers of Pi is used to capture the density of the RP being
identified and thus to expand RP at Pi accordingly
||The identified similar papers are added to RPP and also enqueued
to Q for further expansion, this time with the level increased by 1. Similar
to LB- FormRP() a maximum level of LMax-TopicDecay (which is
3) is employed
Advantage of PB-FormRP() over LB-FormRP() is that it successfully captures
co-existing members of RP as well as those that are not reachable through any
citation path from RPs root (as shown in Fig. 3 above).
We give an example.
Figure 7 shows two RPs; RP1 and RP2.
RP1 contains two co-existing roots A and B. Such a case occurs when
two researchers work on the same problem simultaneously. At some point of our
RP identification process, A will probably be recognized as a root of a new
RP, say RP3, as it has more in-citations than B. And, since B is
not reachable through any path from A, LB-FormRP() will fail to identify B as
a member of RP3. PB-FormRP() will succeed to place both A and B into
RP3 in this case as B is very similar to A. A similar problem will
be observed with paper C that is not reachable through any path from the root.
Furthermore, LB-FormRP() may incorrectlyidentify F, that probably uses a technique
proposed in A, as a member of RP3 when F is really a member of RP2
which co-exists with RP3. PB-FormRP() successfully repels F from
RP3 as F is not similar to A or any of RP3s members,
based on SimRank.
||Examples where PB-FormRP() is more successful than LB-FormRP()
We observe here that PB-FormRP() may capture pyramid-like structures, but not
exactly pyramid structures. SimRank computes similarity between two papers P1
and P2 by averaging the similarity of the citers of both. However,
note that similar papers to a member of an RP will be the other members of the
same RP since members of an RP are usually cited by each other (as they are
motivated by each other).
EMPIRICAL EVALUATIONS OF SCORE FUNCTIONS
AnthP, utilized as the OLDL testbed here, is a publication set of 14,891 publications from the ACM SIGMOD Anthology. After eliminating citations to papers outside AnthP, the average in-citations per AnthP paper is 2.066.
The three citation-based publication score functions (PageRank, Authorities
and Citation count) have separability (high skew) and accuracy problems. We
have observed that 99% of AnthP publications have scores below 0.1. This is
because in-citations conform to the power law distribution, which describes
the scale invariance found in many natural phenomena including publication citation
graphs. As for low accuracy (probably due to topic diffusion problem (Haveliwala,
2002)), different research topics differ in their citation graph densities.
Thus, a paper Ps chances of receiving new citations depends on how dense
the citation graph of the research topic of P is.
AnthP RPs (that represent specific research topics) have an almost normal distribution
in the average in-citations received by members of an RP (Fig.
||Variance of citation-graph densities in different topics.
(a) LB and (b) PB
||Observed RP sizes by LP-IdentifyRP and PB-IdentifyRP. (a)
LB sizes and (b) PB sizes
||Score distributions of PageRank normalized within Rps. (a)
LB PageRank and (b) PB PageRank
For separability, first we verify the RP model on the AnthP set. We have experimentally
observed that only 3.32% of SimRank scores are higher than 0.1, indicating that
AnthP is highly clustered.
Average size of AnthP RP is 15.
Figure 9a and b show the distribution of
the observed RP sizes within AnthP. Note that the PB approach identified larger
RP sizes as it can identify co-existing RP roots and members that are not reachable
through any citation path from the roots.
Figure 10a and b shown that PpgRank-LB
and PPageRank-based publication scores distribute much better overthe
interval [0, 1]. As for the citation-count-based scores, PCitCnt-LB
and PCitCnt-PB, Fig. 11a and b
show that they also distribute much better over the interval [0, 1].
For RP-based scores, the observed skew values (Table 1) range
between (-0.05) and (1.88) in the RP-based scores (zero skew indicates that
the distribution is symmetric).
In comparison, the original scores showed highly skewed values that range between 8.12 and 13.04, which means that they are sharply left-skewed.
For RP-based scores, kurtosis values (that measure how sharply peaked a distribution is) range between (-0.26) to (2.65) (near zero Kurtosis values indicate normally peaked data).
In comparison, in the case of globally normalized scores, Kurtosis values range between (113.28) and (291.10). The enhancement of score distribution comes from the fact that publications are being compared to their peer groups, i.e., publications that belong to the same scope and thus have the same chances of receiving new citations.
The above observations on PageRank (PpgRank, PpgRank-LB,
PPgRank-PB) also apply to Authorities scores (PAuth, PAuth-LB,
PAuth-PB). Here, we report only PageRank-related results as we have
observed that PAuth and PPgRank scores are highly correlated
with a correlation coefficient of 0.98 and the correlation between PPgRank
and PcitCnt is 0.74 (Bani-Ahmad et al.,
Each author in AnthP is identified with (i.e., author papers in) 2.19 and 2.16
LB and PB research pyramids (Fig. 12a, b).
This indicates that publications within an RP are highly related and, thus, the identified RPs are accurate.
We used expert knowledge in the data management field to manually evaluate the accuracy of searching via RPs. For this purpose, we built a prototype keyword-based search system that:
||Sends search keywords to Microsofts Fulltext Search
engine (MsFTS), that indexes the titles of AnthP publications. In turn,
MsFTS generates a list of relevant publications (result set) along with
rank values (which measures text-based relevancy between the publications
and the search keywords)
||For each publication p in the result set, aggregates ps
rank value returned by MsFTS with its scores, measured in two ways, namely
globally-normalized PageRank and LBPageRank. We refer to this final score
as the quality of paper p or Q(p). The quality scores are then used to sort
the search output list so that high quality results appear at the top. The
idea behind this aggregation is to push down publications that have high
PageRank/LBPageRank scores and yet also have low rank values Rank(p), i.e.,
low relevancy to the search keywords. Q(p) is computed according to the
|| Score distributions of Citation-count-based normalized within
Rps. (a) LB CitCnt and (b) PB CitCnt
||Distribution of No. of RPs annotated with each author. (a)
LB and (b) PB
||The means, Inter Quartile Ranges (IQR), skewness and kurtosis
values of the publication score functions
|| Quality values distribution of the search results
We performed multiple searches and manually evaluated the accuracy of our systems
outputs. We observed that LBPageRank-based quality scores resulted in 16-25%
more accurate search outputs than the PageRank-based quality scores.
Observation: Quality scores Q(p) that are calculated computed using
RP-based PageRank distribute much better than those computed using the globally-normalized
PageRank (Fig. 13).
The accuracy search outputs was measured for the top-k publications in the
result sets, where k is 10. In Table 2 and 3,
we report our observations on one search experiment for the keywords complexity
of join. Each publication in the Table 2 and 3
is evaluated by several domain experts who assigned a score between 0 and 10,
where 0 score indicates no relevancy to the search terms and a score of 10 completely
relevant. Integer numbers between those two extreme values indicates different
levels of relevance.
||Sample results of the complexity of join query. Quality is
computed using RP-based PageRank along with the average relevancy scores
as assigned by experts
||Sample results of the complexity of join query. Quality is
computed using the globally-normalized PageRank along with the average relevancy
scores as assigned by experts
Quality scores of search results distribute better when computed based on RP-based
publication score functions (Table 2, 3).
The average expert relevancy scores assigned to publications of Samples of
Table 2 and those of Table 3 are 7.07 and
5.77 (Table 2). The above observation indicates that searching
via RP-based publication scores is more accurate than globally normalized publication
THE CASE EXPLORER PROJECT
The research conducted in this study is part of the CASE EXPLORER project (2003-2008). The project is resumed by Sulieman Bani-Ahmad at Al-Balqa Applied University in Jordan. The CASE EXPLORER is a score-guided searching and querying prototype portal for ACM SIGMOD Anthology, a digital library for the database systems research community, containing about 15,000 papers. CASE EXPLORER has a powerful user interface that allows users to pose score-guided ad hoc queries to search the Anthology, automatically computes the scores of query results from the scores of database objects (papers, authors, publication venues) and returns either the top-k results or results with high scores. CASE EXPLORER database is built by extracting metadata from the Anthology, storing it in a database, deriving multiple scores for papers, authors and publication venues. Propagating database scores to query outputs is achieved by a unique score propagation methodology. A rich set of queries are offered to users using a powerful and innovative user interface that allows users to add arbitrarily many conditions to their queries.
As an extension of the CASE EXPLORER project, Bani-Ahmad resumed the project in Jordan and is currently working on enhancing example-based search in literature digital libraries.
We would like to thank the reviewers for their very valuable comments that have substantially improved the paper in both substance and in presentation. We would also like to thank all the CASE EXPLORER project team at Case Western Reserve University. Namely, Nattakarn Ratprasartporn, Ali Cakmak who helped in enriching the paper by their discussion and comments. The project is financially supported by the National Science Foundation (NSF) grant number 8/2009-8/2012 and partially supported by the school of graduate studies at al-Balqa Applied University in Jordan.
In this study, we validated the Research-Pyramid model proposed by Aya
et al. (2005). We proposed two algorithms to identify the research
pyramids of a given collection. We also used the research pyramid model and
the identified research pyramids to solve the separability and accuracy problems
of publication score functions. We showed that normalizing publication scores
within their research pyramids provides more accurate and separable (less skewed
scores). Moreover, we showed that ranking search results by these scores promises
to give higher accuracy compared to ranking by globally normalized publication
scores due to reduction of topic diffusion effect.