INTRODUCTION
Searching OnLine Literature Digital Libraries (OLDLs) efficiently and effectively is becoming more and more important as the size and use of OLDLs expand at a very high rate. Consider the following three OLDLs as examples from computer life sciences and from electrical engineering fields:
• 
In computer science, ACM digital library (ACM) has around
one million fulltext publications collected over fifty years, all available
to search and download 
• 
In electrical engineering and computer science, IEEE xplorer (IEEE), is
another OLDL that provides its users with access over the web to more than
1,700 selected conferences proceedings 
• 
ScienceDirect (ScienceDirect) is the world’s leading scientific,
technical and medical information resource that celebrated its billionth
article download back in November’06 since, it has been launched and
put into service in 1999 
From the above stated numbers one may come to a conclusion that providing accurate
publication importance scores for search results and ranking publications returned
as search results accurately can significantly help OLDL users in reducing the
time they spend in searching OLDLs. Furthermore, accurate and effective publication
rankings can also be useful for comparative assessments of publications as well
as publication venues and research institutes such as universities. Yet more,
properly ranking authors’ publications may help in comparatively evaluating
scientists as well.
At the present time, OLDLs lack effective and accurate publication rankings
(Ratprasartporn et al., 2007). For instance,
the ACM Digital Library returns unexplained rankings of publication search results
that make this ranking not useful to users (ACM). Moreover, search output results
of OLDLs tend to experience high level of the topic diffusion problem, which
is defined as having large number of search results from multiple topics that
are not of the current user’s interest (Ratprasartporn
and Ozsoyoglu, 2007; Voorhees and Buckley, 2002;
Lin, 2005).
The topic diffusion problem occurs because keywordbased searches produce a
large number of publications over a relatively large number of topics, thereby
producing publication importance scores that are nonspecific to topics (Ratprasartporn
and Ozsoyoglu, 2007; Voorhees and Buckley, 2002;
Lin, 2005).
Using social networks or bibliometrics, a number of publication score functions
has been defined in literature (Brin and Page, 1998;
Kleinberg, 1998; BaniAhmad et al., 2005a).
In BaniAhmad et al. (2005b), the authors have
comparatively evaluated several citationbased publication score functions,
including, (1) PageRank proposed Brin and Page (1998),
(2) Authorities scores proposed in Kleinberg (1998),
both adopted from the www research domain and (3) citationcount scores from
the bibliometrics research domain (Chakrabarti, 2003).
BaniAhmad et al. (2005a, b)
observed that all those three score functions suffer from the separability problem,
that is; none of these scoring functions assigns scores that distribute well
over a given scale, e.g., [0, 1]. Instead, scores distributions of the three
experimented publication score functions are found to be highly skewed (BaniAhmad
et al., 2005a, b) and decay very fast (Redner,
2004; BaniAhmad et al., 2005a, b),
resulting in a much less useful comparative publication assessment capability
for users.
This lack of separability is caused by the rich gets richer phenomena identified
in (Redner, 2004; Li and Chen, 2003).
The rich gets richer phenomena involves observing a very small number of publications
with relatively high numbers of in citations. Those highlycited publications
have even higher chances of receiving new citations. Further studies show that,
yet, these citationbased scoring functions are also not very accurate, probably
caused by topic diffusion in search outputs (Haveliwala,
2002).
The researchevolution model proposed by Aya et al.
(2005) suggested that citation relationships between research publications
produce multiple, small pyramidlike structures, where each pyramid represents
a set of publications that are related to a highly specific research topic.
A research pyramid is defined (Aya et al., 2005)
as a set of publications that represent a highly specific research topic and
usually has a pyramidlike structure in terms of its internal citation graph
(Aya et al., 2005).
Publications within an individual research pyramid are: (1) motivated by earlier
publications in the topic area (e.g., this paper is motivated in part by citations
(Ratprasartporn et al., 2007; Aya
et al., 2005), or (2) use techniques proposed in publications from
other research pyramids (e.g., this study in part uses some of the techniques
presented in citations (Brin and Page, 1998; Kleinberg,
1998)). Other reasons for citations may also be observed (Aya
et al., 2005).
PROBLEM STATEMENT
In this study, our goals are to (1) provide a solution to the OLDL search output ranking problem due to the topic diffusion problem, by grouping search outputs at the mostspecific (detailed) topic level and without identifying the topics themselves, (2) eliminate the low separability problem of score functions and (3) improve the accuracy of three score functions, namely, PageRank, authorities and citation count score functions. Our approach uses the research pyramid (RP) model to improve the separability and accuracy of publication scores and is based on normalizing publication scores within a limited scope, namely, within individual research pyramids. These improvements come from the fact that publications are now compared to their peers within their peer groups, namely, their own research pyramid publications that are on the same topic.
This study proposes and empirically evaluates two approaches to identify research
pyramids. The first, called LBIdentifyRP, uses linkbased research pyramid
identification, which captures research pyramids by identifying pyramidlike
structures from the citation graph of the publication set. The second approach,
called PBIdentifyRP, uses proximitybased research pyramid identification,
utilizes a graphbased proximity measure, namely SimRank (Jeh
and Widom, 2002), to compute similarities between publications and then
restructures the kmostsimilar publications into a research pyramid.
This study’s contributions are:
• 
Validate the research pyramid model of research evolution 
• 
Propose and evaluate two algorithms to identify research pyramids 
• 
Improve publication scores in terms of accuracy and separability via publications’
research pyramids 
• 
As a testbed, we have utilized AnthP, a publication set of 14,891 publications
from the ACM SIGMOD Anthology. Our experimental results show that: 
• 
The complete publication citation graph (of AnthP) is highly clustered 
• 
Each cluster of the complete publication set has a pyramidlike structure
in terms of the citation graph of the cluster 
• 
Each cluster represents a highly specific research topic. Note that the
above three findings validate the research pyramid model proposed by Aya
et al. (2005) 
• 
Topic similarities decay over both the citation age and citation paths 
• 
We used the two topic similarity decay curves to guide the RP construction: 
• 
Within RP citation graphs, the average number of in citations per paper
varies, pointing to the importance of comparative publication scores within
RPs 
• 
Publication scores within RPs are accurate, due to our approach where
each publication is compared only to its peer (research pyramid paper) group 
CITATIONBASED PUBLICATION SCORES
Existing citationbased publication score functions are all based on the notion
of prestige in social networks (Wasserman and Faust, 1994)
and bibliometry (Chakrabarti, 2003). In this study, as
publication score functions we use:
PageRank (Brin and Page, 1998) algorithm:
PageRank score P_{PageRank} of a publication P is recursively computed as the normalized sum of PageRank scores of documents citing P.
Authority score of the HITS (hyperlink induced topic search) algorithm (Kleinberg, 1998):
Each document P gets two scores, namely hub and authority scores. Hub score of P is computed by summing up authority scores of the publications that P cites and the Authority score of P, denoted by P_{Auth}, is computed by summing the hub scores of publication citing P.
Normalized citation count score:
For a particular paper P that receives C_{P} citations, the normalized citation count P_{CitCount} is the ratio of C_{P }to the number C_{Pmax} of incitations of the most cited paper in the publication set.
Figure 1ac show that the three score functions,
namely, P_{PageRank}, P_{Auth} and, P_{CitCount} are
highly skewed and do not separate scores well. Notice that the papers that are
cited the most have the score of 1.0. Those papers are very few (less than one
percent). The majority of scores cluster around the 0.1 value. This is because
that, in the publication set used, 73.2% of the papers have received two citations
or less. Thus, the majority of the publication set papers has received low scores
that cluster around the 0.1 value.
Pan (2006), the author observed the skewness and inseparability
of these functions independently in computer science and life sciences publications
(70,000 documents in each) as well. And, it is shown (Redner,
2004; Li and Chen, 2003) that distributions of citationbased
score functions are also highly skewed and decay very fast. We think that the
cause is topic diffusion since scores are computed with respect to the full
publication set. By using the researchpyramid model proposed by Aya
et al. (2005), we normalize scores of publications within their own
research pyramids, which allows for a fair comparative assessment of publications
as publications are compared to their peers in their own research pyramids.

Fig. 1: 
Histograms of (a) CitCnt, (b) Auth and (c) PageRank. Score
distribution of the three publication score functions. Publication set used
consists of 15,000 publications from ACM Anthology all from the domain of
data mining 
PROPERTIES OF RESEARCH PYRAMID MODEL
We have observed three properties of research publications in three separate
data sets, namely, ACM Anthology (AnthP; 15,000 publications) (AlHamdani,
2003) and computer sciences and life sciences publication sets (each with
70,000 publications) (Pan, 2006).
Property 1 (maximum citation age):
In online digital libraries (OLDLs), most publications receive most of their incitations within a fixed number of years after their publication dates. We refer to this value as the Maximum Citation Age and denote it by C_{AgeMax}.
We have observed (BaniAhmad et al., 2005a,
b; Pan, 2006) that, in AnthP and
Computer Sciences and Life Sciences OLDLs, most publications receive 90% of
their incitations in 10 years after they get published, i.e., C_{ageMax}
= 10. Below in Property 4, we give a tighter bound for citation age within which
topical similarity within an RP is maintained between citing and cited publications.
Figure 2 presents the citation age distributions in AnthP.
We noticed that within ten years after their publishing date, publications receive
90% of their citations, that is, it is highly unlikely that publications receive
new citations after 10 years of its publication dates. The Fig.
2 also shows that most papers receive top popularity and awareness levels
after 5 years of its publications.
In rare cases, publications may cite works older than C_{AgeMax}. It
is found (Ahmed et al., 2004; Case
and Higgins, 2000) that a great proportion of these citations are for historical
reasons, which we interpret as: old cited works (1) have coarse similarity to
citing papers and (2) do not belong in the RP of the citing publication.
Property 2 (topic specificity over time):
Scientific research publications quickly become very topicspecific over time, usually referable via a highly specific topic.
As shown in Fig. 3, an old research pyramid that covers a
certain research topic leads to instantiations of new research topics and thus
to creations of new RPs, that use techniques proposed in the publications of
parent RP(s). Again, such old citations carry topical similarity between the
citing and cited publication at a coarse granularity level. Possible citation
exchanges between different RPs also occur and are of type uses, i.e., the citing
paper uses techniques proposed by the cited paper.

Fig. 2: 
Citation age distribution curve of AnthP 

Fig. 3: 
The RPbased model 
Example:
Codd’s paper E. F. Codd, A Relational Model of Data for Large Shared
Data Banks, Commun. ACM 13(6): 377387(1970) is about the topic relational model
and cited around 580 times. A new and more specific topic of 2000’s (i.e.,
citation to Codd’s work is 30+ years old), say, rankaware join algorithms,
is coarsely related to the more general topic relational model in that, a publication
P in the RP of rankaware join algorithms and citing Codd’s paper uses
the techniques proposed in the RP of the relational model.
Property 3 (topic similarity decay over citation path):
After very small citation path distances, topical similarity between papers decays significantly.
From Fig. 4, in AnthP, after a citation path of length 3,
the topical similarity, as measured by SimRank, significantly decays. We refer
to this value by L_{MaxTopicDecay}. This observation led us to build
RPs of height at most 3 in the experimental results section.
Property 4 (topic similarity decay over citation age):
After a certain citation age, topical similarity between the citing and the cited papers significantly decays.

Fig. 4: 
SimRank score change with citation distance 

Fig. 5: 
SimRank score change with citation age 
From Fig. 5, in the AnthP set, after a citation age of about
5 years, the topic similarity between the citing and cited papers decays significantly.
We refer to this value by C_{AgeMaxTopicDecay}. This observation led
us to build RPs in the experimental results section such that the maximum citation
age within an RP is 5 years.
Next we present the two characteristics that identify a research pyramid RP.
RPproperty 1 (high topic specificity):
An RP, usually organizable into a pyramid, is a set of publications that represent
a highly specificresearch topic. We maintain high topic specificity of RPs by
applying properties 3 and 4 and keeping the height of research pyramids low
(property 3). Note that we make no attempts to identify the topic associated
with an RP, as our approach does not need the topics explicitly. But, in interactive
environments, providing topics to users is useful (Ratprasartporn
and Ozsoyoglu, 2007).
RPproperty 2 (research pyramid construction):
RPs are arranged into pyramid structures either directly by using citation
graphs (i.e., the linkbased approach) (Aya et al.,2005)
or indirectly using the publication times and close proximity of papers (i.e.,
the proximitybased approach).
RESEARCH PYRAMID IDENTIFICATION PROCEDURES
Based on the properties of publications and characteristics of RPs, next we propose two offline research pyramid identification procedures, namely, the LinkBased (LB) and the ProximityBased (PB) RP identification procedures.
Both procedures start by choosing a candidate root node for an RP, called the cornerstone paper. The paper that is located at the root of a research pyramid receives more citations than others as other publications within the research pyramid are motivated by it and directly or indirectly cite it. Thus, our approach is to identify papers with high incitations as cornerstone papers (i.e., the roots) of RPs to be constructed.
The linkbased procedure locates research pyramids by identifying pyramidlike
structures in the citation graph of the publication set. In summary, within
an individual RP, publications are topically related (Aya
et al., 2005) and motivated by each other (Fig. 3)
(Aya et al., 2005) and we use the four properties
of section 3 to identify citations within RPsas summarized next.
In AnthP, the average number of citations to a paper (incitations), denoted by C_{I}, is 2.066. Note that, in our experiments, we consider only the AnthP citations that are completely within AnthP; any citation from a paper within AnthP to a paper that is not in AnthP is removed. Using Property 3 and RPProperty 1, we limit RP heights to 3. Thus, the expected number of papers within a research pyramid RP_{P} with paper P as the root and with height 3 is RP_{P} = 1 + C_{I} + C_{I}^{2} + C_{I}^{3} ˜ 15. Of course, the actual identified RP sizes (the number of papers in RP_{P}) vary. Some RPs may deal with active research topics and, in such cases, the number of incitations of publications are noticeably higher than_{ }C_{I}, leading to noticeably higher RP sizes as well.
Figure 6a presents the linkbased LBIdentifyRP() procedure
that utilizes citationrelationships between publications to identify the researchpyramid
structures of the publication set at hand. The procedure LBIdentifyRP() (1)
selects a cornerstone paper P from the existing publication set (originally,
say, AnthP) as an RP root, by simply picking the current mostcited publication
(only citations that are C_{AgeMaxTopicDecay} old according to property
4 above), (2) calls LBFormRP() to locate the RP set RP_{P} of P and
(3) eliminates RP_{P} from the current publication set CurrAnthP and
repeats (a)(c) again, until no more publications are left in CurrAnthP.

Fig. 6: 
Functions of LB and PBIdentifyRP algorithms. (a) procedure
LBIdentifyRP, (b) function ChooseRoot, (c) function LBFormRP() and (d)
function PBFormRP() 
Note that our approach in this paper is to create distinct and nonoverlapping
research pyramids. An alternative approach is to allow overlapping research
pyramids as follows: Do not eliminate any papers from the original publication
set (i.e., remove step (c) above); instead, simply color each selected publication
and continue until all publications are colored, meaning that, when the algorithm
ends, each paper belongs to at least one RP set and possibly more.
The two main functions of the linkbased LBIdentifyRP() procedure are ChooseRoot()
and LBFormRP(). ChooseRoot() (Fig. 6b) chooses publications
that are cornerstone papers, or roots of research pyramids. The function LBFormRP()
(Fig. 6c) forms the RP_{P} of a root publication P
by adding direct citers of P (i.e., level1 citers) into RP_{P} and
indirect citers of P at a level up to the L_{Max}; in experiments, we
choose L_{Max} as 3, by following the property 3. The function citers
(P, l, C_{AgeMaxTopicDecay}) returns the set of publications that
cite P at a level l (which is at most L_{Max}) where the citation age
of the citing paper with respect to P is less than the maximum citation age
C_{AgeMaxTopicDecay}, (Properties 1 and 4). In more detail:
• 
Paperid pid_{P} of root P along with its level 0
is inserted into RP_{P} and the queue Q, which holds paperids for
future expansions and their distances to the root paper P 
• 
Twotuple <P_{i}, l> in Q is dequeued and expanded by locating
direct or indirect citers of P_{i} so long as their levels with
respect to P is at most L_{MaxTopicDecay} (i.e., 3) and their citation
age with respect to P (the root) is less than the maximum citation age C_{ageMaxTopicDecay}
(i.e., 5). All expanded publications and their level info with respect to
P are inserted into the queue Q 
• 
The above two steps are repeated until Q is empty; then RP_{P}
is returned 
The proximitybased PBIdentifyRP() is similar to the linkbased, except that
the function call to LBFormRP() is replaced by the function call PBFormRP().
The function PBFormRP() (Fig. 6d) of the proximitybased
approach utilizes a graphbased proximity measure, namely SimRank (Jeh
and Widom, 2002), to compute similarities between publications. It captures
RP_{P} of the root publication by locating publications that are most
similar to P and yet (a) are linked to P with a citation path length of at most
L_{MaxTopicDecay} and (b) have a citation time distance less than C_{AgeMaxTopicDecay}.
SimRank iteratively computes similarity scores between nodes in a graph G following
the rule that two nodes are similar if they are linked with similar nodes. In
other words, the SimRank similarity between two nodes a and b, S(a, b), is iteratively
computed using the formula (until the similarity scores converge):
where, I (a) and I (b) are sources of inlinks of a and b, respectively. C
is the decay factor between 0 and 1. We choose C = 0.8 (Jeh
and Widom, 2002). If I (a)orI (b) = 0 then S(a,
b) = 0 by definition, in the case where a = b, S (a, b) = 1. The space complexity
of the naive SimRank algorithm is O(N^{2}) where N is the graph size
(the citation graph in publication domain). We prune as in Jeh
and Widom (2002) by considering node pairs that are near each other in the
range of radius r. We choose r = 6, which is twice the value of the expected
research pyramid height as also explained in earlier.
PBFormRP() receives as input the root P, the maximum level L_{Max} from root and utilizes the maximum citation age C_{AgeMaxTopicDecay} (as 5) and returns the RP set RP_{P} of publication P following the same main steps of LBFormRP() with one main difference: the way the twotuple <P_{i}, l> dequeued from Q is expanded, as follows:
• 
Top Citers(P_{i} ,_{l,CAgeMaxTopicDecay})
similar papers, based on SimRank, to P_{i} are identified. The number
of citers of P_{i} is used to capture the density of the RP being
identified and thus to expand RP at P_{i} accordingly 
• 
The identified similar papers are added to RP_{P} and also enqueued
to Q for further expansion, this time with the level increased by 1. Similar
to LB FormRP() a maximum level of L_{MaxTopicDecay} (which is
3) is employed 
Advantage of PBFormRP() over LBFormRP() is that it successfully captures
coexisting members of RP as well as those that are not reachable through any
citation path from RP’s root (as shown in Fig. 3 above).
We give an example.
Example:
Figure 7 shows two RPs; RP_{1} and RP_{2}.
RP_{1} contains two coexisting roots A and B. Such a case occurs when
two researchers work on the same problem simultaneously. At some point of our
RP identification process, A will probably be recognized as a root of a new
RP, say RP_{3}, as it has more incitations than B. And, since B is
not reachable through any path from A, LBFormRP() will fail to identify B as
a member of RP_{3}. PBFormRP() will succeed to place both A and B into
RP_{3} in this case as B is very similar to A. A similar problem will
be observed with paper C that is not reachable through any path from the root.
Furthermore, LBFormRP() may incorrectlyidentify F, that probably uses a technique
proposed in A, as a member of RP_{3} when F is really a member of RP_{2}
which coexists with RP_{3}. PBFormRP() successfully repels F from
RP_{3} as F is not similar to A or any of RP_{3}’s members,
based on SimRank.

Fig. 7: 
Examples where PBFormRP() is more successful than LBFormRP() 
We observe here that PBFormRP() may capture pyramidlike structures, but not
exactly pyramid structures. SimRank computes similarity between two papers P_{1}
and P_{2} by averaging the similarity of the citers of both. However,
note that similar papers to a member of an RP will be the other members of the
same RP since members of an RP are usually cited by each other (as they are
motivated by each other).
EMPIRICAL EVALUATIONS OF SCORE FUNCTIONS
AnthP, utilized as the OLDL testbed here, is a publication set of 14,891 publications from the ACM SIGMOD Anthology. After eliminating citations to papers outside AnthP, the average incitations per AnthP paper is 2.066.
The three citationbased publication score functions (PageRank, Authorities
and Citation count) have separability (high skew) and accuracy problems. We
have observed that 99% of AnthP publications have scores below 0.1. This is
because incitations conform to the power law distribution, which describes
the scale invariance found in many natural phenomena including publication citation
graphs. As for low accuracy (probably due to topic diffusion problem (Haveliwala,
2002)), different research topics differ in their citation graph densities.
Thus, a paper P’s chances of receiving new citations depends on how dense
the citation graph of the research topic of P is.
Observation:
AnthP RPs (that represent specific research topics) have an almost normal distribution
in the average incitations received by members of an RP (Fig.
8a, b).

Fig. 8: 
Variance of citationgraph densities in different topics.
(a) LB and (b) PB 

Fig. 9: 
Observed RP sizes by LPIdentifyRP and PBIdentifyRP. (a)
LB sizes and (b) PB sizes 

Fig. 10: 
Score distributions of PageRank normalized within Rps. (a)
LB PageRank and (b) PB PageRank 
For separability, first we verify the RP model on the AnthP set. We have experimentally
observed that only 3.32% of SimRank scores are higher than 0.1, indicating that
AnthP is highly clustered.
Observation:
Average size of AnthP RP is 15.
Figure 9a and b show the distribution of
the observed RP sizes within AnthP. Note that the PB approach identified larger
RP sizes as it can identify coexisting RP roots and members that are not reachable
through any citation path from the roots.
Figure 10a and b shown that P_{pgRankLB}
and P_{PageRankbased} publication scores distribute much better overthe
interval [0, 1]. As for the citationcountbased scores, P_{CitCntLB}
and P_{CitCntPB}, Fig. 11a and b
show that they also distribute much better over the interval [0, 1].
Observation:
For RPbased scores, the observed skew values (Table 1) range
between (0.05) and (1.88) in the RPbased scores (zero skew indicates that
the distribution is symmetric).
In comparison, the original scores showed highly skewed values that range between 8.12 and 13.04, which means that they are sharply leftskewed.
Observation:
For RPbased scores, kurtosis values (that measure how sharply peaked a distribution is) range between (0.26) to (2.65) (near zero Kurtosis values indicate normally peaked data).
In comparison, in the case of globally normalized scores, Kurtosis values range between (113.28) and (291.10). The enhancement of score distribution comes from the fact that publications are being compared to their peer groups, i.e., publications that belong to the same scope and thus have the same chances of receiving new citations.
The above observations on PageRank (P_{pgRank}, P_{pgRankLB},
P_{PgRankPB}) also apply to Authorities scores (P_{Auth}, P_{AuthLB},
P_{AuthPB}). Here, we report only PageRankrelated results as we have
observed that P_{Auth} and P_{PgRank} scores are highly correlated
with a correlation coefficient of 0.98 and the correlation between P_{PgRank}
and P_{citCnt} is 0.74 (BaniAhmad et al.,
2005a, b).
Observation:
Each author in AnthP is identified with (i.e., author papers in) 2.19 and 2.16
LB and PB research pyramids (Fig. 12a, b).
This indicates that publications within an RP are highly related and, thus, the identified RPs are accurate.
We used expert knowledge in the data management field to manually evaluate the accuracy of searching via RPs. For this purpose, we built a prototype keywordbased search system that:
• 
Sends search keywords to Microsoft’s Fulltext Search
engine (MsFTS), that indexes the titles of AnthP publications. In turn,
MsFTS generates a list of relevant publications (result set) along with
rank values (which measures textbased relevancy between the publications
and the search keywords) 
• 
For each publication p in the result set, aggregates p’s
rank value returned by MsFTS with its scores, measured in two ways, namely
globallynormalized PageRank and LBPageRank. We refer to this final score
as the quality of paper p or Q(p). The quality scores are then used to sort
the search output list so that high quality results appear at the top. The
idea behind this aggregation is to push down publications that have high
PageRank/LBPageRank scores and yet also have low rank values Rank(p), i.e.,
low relevancy to the search keywords. Q(p) is computed according to the
following formula: 

Fig. 11: 
Score distributions of Citationcountbased normalized within
Rps. (a) LB CitCnt and (b) PB CitCnt 

Fig. 12: 
Distribution of No. of RPs annotated with each author. (a)
LB and (b) PB 
Table 1: 
The means, Inter Quartile Ranges (IQR), skewness and kurtosis
values of the publication score functions 


Fig. 13: 
Quality values distribution of the search results 
We performed multiple searches and manually evaluated the accuracy of our system’s
outputs. We observed that LBPageRankbased quality scores resulted in 1625%
more accurate search outputs than the PageRankbased quality scores.
Observation: Quality scores Q(p) that are calculated computed using
RPbased PageRank distribute much better than those computed using the globallynormalized
PageRank (Fig. 13).
The accuracy search outputs was measured for the topk publications in the
result sets, where k is 10. In Table 2 and 3,
we report our observations on one search experiment for the keywords complexity
of join. Each publication in the Table 2 and 3
is evaluated by several domain experts who assigned a score between 0 and 10,
where 0 score indicates no relevancy to the search terms and a score of 10 completely
relevant. Integer numbers between those two extreme values indicates different
levels of relevance.
Table 2: 
Sample results of the complexity of join query. Quality is
computed using RPbased PageRank along with the average relevancy scores
as assigned by experts 

Table 3: 
Sample results of the complexity of join query. Quality is
computed using the globallynormalized PageRank along with the average relevancy
scores as assigned by experts 

Observation:
Quality scores of search results distribute better when computed based on RPbased
publication score functions (Table 2, 3).
The average expert relevancy scores assigned to publications of Samples of
Table 2 and those of Table 3 are 7.07 and
5.77 (Table 2). The above observation indicates that searching
via RPbased publication scores is more accurate than globally normalized publication
scores.
THE CASE EXPLORER PROJECT
The research conducted in this study is part of the CASE EXPLORER project (20032008). The project is resumed by Sulieman BaniAhmad at AlBalqa Applied University in Jordan. The CASE EXPLORER is a scoreguided searching and querying prototype portal for ACM SIGMOD Anthology, a digital library for the database systems research community, containing about 15,000 papers. CASE EXPLORER has a powerful user interface that allows users to pose scoreguided ad hoc queries to search the Anthology, automatically computes the scores of query results from the scores of database objects (papers, authors, publication venues) and returns either the topk results or results with high scores. CASE EXPLORER database is built by extracting metadata from the Anthology, storing it in a database, deriving multiple scores for papers, authors and publication venues. Propagating database scores to query outputs is achieved by a unique score propagation methodology. A rich set of queries are offered to users using a powerful and innovative user interface that allows users to add arbitrarily many conditions to their queries.
As an extension of the CASE EXPLORER project, BaniAhmad resumed the project in Jordan and is currently working on enhancing examplebased search in literature digital libraries.
ACKNOWLEDGMENTS
We would like to thank the reviewers for their very valuable comments that have substantially improved the paper in both substance and in presentation. We would also like to thank all the CASE EXPLORER project team at Case Western Reserve University. Namely, Nattakarn Ratprasartporn, Ali Cakmak who helped in enriching the paper by their discussion and comments. The project is financially supported by the National Science Foundation (NSF) grant number 8/20098/2012 and partially supported by the school of graduate studies at alBalqa Applied University in Jordan.
CONCLUSIONS
In this study, we validated the ResearchPyramid model proposed by Aya
et al. (2005). We proposed two algorithms to identify the research
pyramids of a given collection. We also used the research pyramid model and
the identified research pyramids to solve the separability and accuracy problems
of publication score functions. We showed that normalizing publication scores
within their research pyramids provides more accurate and separable (less skewed
scores). Moreover, we showed that ranking search results by these scores promises
to give higher accuracy compared to ranking by globally normalized publication
scores due to reduction of topic diffusion effect.