1. Introduction
During
Project 2 we incorporated text-based multi-term retriever and ranking system
with a Web (CGI) interface. Our third project consists of extending the Project
2 implementation to incorporate link-based authority ranking. The remainder of
this document is organized as follows: Section 2 presents the indexed data that
was used for retrieval. Section 3 lists the resources chosen for the
implementation. Section 4 shows the PaW interface. Section 5 describes PaW’s
architecture. Section 6 lists the algorithms we implemented. Section 7 lists
the changes to the indexer in order to support ranking. Section 8 lists the
changes to the Client/Server module. Section 9 lists the main files used by our
Search Engine. Section 10 points some concerns about the Scalability of our
solution. Section 11 shows a comparison between TFIDF and PaW score
performance. Section 12 lists the necessary steps to run the system. Section 13
illustrates how the team is organized. Section 14 lists the documentation and
Section 15 lists the references.
2. Data
We
used a set of documents from a recent crawl of the UK. The compressed files can
be found in /proj/searchengines/data/summary*.warc.gz which accounts for eight
archives containing approximately 440k documents each.
3. Resources
There is a large variety of programming languages that could be taken in consideration for this project. The following sections list the languages we chose, explain why we chose them.
Java and C were the core languages for our implementation. Our choice was mainly based on the fact that they were the languages used for RankIt, the implementation we are going to build upon.
3.1. Java
All
the modules but the Parser are written in Java. We considered our past
developing experience in Java as a factor in our decision.
3.2. C
C is
the language used for the Parser. It’s fast at dealing with a large amount of
data according to the experience of two previous projects. We continued using C
for parsing as needed by the ranking algorithm and linkage analysis algorithm.
4. PaW Interface
The
CGI is a Perl interface. It allows a user to send a query to the search engine and
check the results.

PaW CGI: http://wwwtest.cse.lehigh.edu/~ffp206/PaW.cgi
5. PaW Architecture
5.1. Overall
Architecture
The Server
loads the offsets to the Documents List and Postings List to memory during
startup. The Retriever is responsible to call the Ranker’s appropriate method
depending on the chosen algorithm (TF-IDF or PaW Score) through the CGI
interface.
The Ranker accesses the postings list, sorts the results
and returns them to the Retriever. The Retriever is responsible to add
additional information such as title and URL to each document and send the
results back to the Server.

5.2. Offline
Processes
In addition
to the Parser, we used some offline processes for additional tasks, such as:
·
In/Outlinks generation;
·
PageRank calculation;
·
TF-IDF calculation;
·
Anchor text (including extra text within the sliding
window) weighting;
·
Final postings list generation.
·
Extraction of document information such as title and
URL for the retriever.
5.3. URL
Normalization [1][2]
Table 1 lists the URL
Normalization that will be implemented for PaW.
|
Normalization |
Equivalent
URLs |
|
|
Replace %7E by tilde |
http://www.this%7Eab.com/ |
http://www.this~ab.com/ |
|
Case Normalization (just
for the host) |
HTTP://www.EXAMPLE.com/AB |
http://www.example.com/AB |
|
Replace space by %20 |
http://www.this ab.com/ |
http://www.this%20ab.com/ |
|
Remove Dot Segments |
/a/b/c/./../../g |
/a/b/c/g |
|
Scheme-Based Normalization |
http://example.com |
http://example.com/ |
|
http://example.com:/ |
http://example.com:80/ |
|
Table 1 - PaW URL Normalization
6. Ranking Algorithms
The
PaW Score is a combined score considering TF-IDF, PageRank and the Modified
CLEVER scores.
6.1. TF-IDF
The
TF-IDF weight was calculated offline and stored in the postings list.
The
term frequency in the given document is simply the number of times a given term
appears in that document. This count was normalized to prevent a bias towards
longer documents. The inverse document frequency is a measure of the general
importance of the term.
Term
frequency: tfi = ni / ∑k nk
Inverse
document frequency: idfi = log ( |D| / |{d : d ∋ ti}|
with
* |D| : total number of
documents in the corpus
* |{d : d ∋ ti }|: number of
documents where the term ti appears (that is n ≠ 0).
tfidf = tf * idf
6.2. PageRank
We implemented PageRank with random surfer d=0.85 by the following equation:
¶ d is the damping factor, we use 0.85 in our implementation
¶ li,j represent the page i links to page j
¶
Wi and Wj represent the authority value of page i and j.
¶
ni is the total outlinks of page i.
At every iteration, page j’s authority is the sum of its inlinks’ source page i’s authority divided by i’s outlink size times d plus 1-d.
We initially assign each page’s authority to be 1, dampening factor 0.15. The final rank has a value always greater than 0.15.
There
are pages with incoming links but no outgoing links. The nodes without outgoing
links are called dangling nodes.
The existence of the dangling nodes will cause rank sink in the PageRank
calculation. We consider the proper dealing with dangling node will improve our
calculated PageRank quality. We chose to implement similar to jump-weighted
approach for dealing with dangling nodes. We collect the dangling nodes’ total
authority score and do biased
redistribution so that dangling pages doesn’t receive less of this rank. The
dangling nodes will only get the random surfer dampening factor and the
authority from its incoming links. The non-dangling node will get extra
authority of the average dangling authority.
One iteration of PageRank calculation takes
about 11267 ms for 3370169 links. We used the
Euclidean Distance [3] as a measure of convergence. After the distance
of two PageRank is smaller than 0.1, we finish the calculation and consider it
to be converged. It took 46 iterations for PageRank to converge within distance
of 0.1.

Figure 6.1
Convergence of PageRank with avgDanglingAuthority.

Figure 6.2
Convergence of PageRank without avgDanglingAuthority.
Figure 6.1 is the convergence curve of PageRank value with dangling nodes handling. Figure 6.2 is the convergence curve without handling the dangling nodes. You can see that due to the introduction of average dangling authority, the convergence curve of Figure 6.1 is not smooth, but the PageRank calculation converges faster than the calculation without handling dangling nodes.
6.3. Modified
CLEVER
We choose to index the anchor texts and texts within the sliding window range with the target URL. This provides a global view of target page. The words in the anchor text are given higher weight than the texts in the sliding window. This means anchor text plays a more important role in describing the target page. Every page will be associated with a CLEVER weight according to the current query term. Besides the linkage analysis with PageRank, the modified CLEVER provides more sensitivity for our ranking algorithm.
¶ We index the anchor text with weight 2 with the target URL. We index extra 10 words with weight 1 around the anchor text both backward and forward.
We consider using 0.2*(totalWeight) as the CLEVER score in our implementation to combine with the PageRank and tfidf.
7. RankIt Parser
RankIt
is the parser we extended. The main features that were added to the RankIt
indexing are:
¶ Extract outlinks of each page, and create intermediate inlink file combining the information extracted from each of the eight zip files. The normalization module uses both files in order to create the final links files.
¶ Modify the anchor text indexer to index texts within the sliding window range.
¶ While indexing the anchor text to target URL, give different weight according to the text location.
¶ Extend parser to index terms and create dictionary for the whole collection (eight zip files).
8. EzSearch Client/Server Architecture
We used EzSearch
Client/Server implementation because it is simple and yet has a great
performance compared to the others. The minimal socket communication and
optimized ranker made EzSearch the best option.
The main features that were added
to EzSearch Client/Server implementation are:
·
Change in the interface to
accommodate the choice of ranking algorithm (TF-IDF/ Modified PageRank);
·
Adjust the server structures
in order to load the necessary information to the memory;
·
Adjust the Ranker structures.
9. Disk Usage
The
following table lists the files created, a brief description, their format and
size.
|
File Name |
Description |
Format |
Size (bytes) |
|
new_doc_url |
Maps URLs to docIDs |
URL docID |
969751026 |
|
docinfo |
Lists the URL and Title for
each docID. Title might contain multiple lines. |
> docID URL Title |
1052347439 |
|
docinfooffset |
docinfo offset file. |
docID offset |
217826129 |
|
StopList |
Contains the stop list. A term
per line. |
|
1965 |
|
final_rank |
Stores the PageRank. |
docID PageRank |
53731328 |
|
doclen |
Stores the length of each
document. |
docID length |
38914283 |
|
linksfilelist |
Contains a list of documents
to be processed by CreateInOutLinks (outlinks_*) |
file name |
88 |
|
outlinks_final |
Maps a docID to its outlinks. |
docID outID outID |
568673959 |
|
inlinks_final |
Maps a docID to its inlinks. |
docID inID inID |
554276070 |
|
contentdict |
Postings List |
>term docID freq TFIDF PaW |
17686057471 |
|
contentoffset |
Postings List Offset |
term offset |
75361428 |
10.
Scalability
The anchor indexer doesn’t scale well with the hashtable growing. Now the indexing of 100k pages takes about one hour. As we are indexing the anchor text with its target URL, the inverted file list keeps the target URLs. Before inserting a new term paired with a URL, we will go through the linked inverted file list and check if we are inserting a new URL. This process slows down with the hashtable grows.
One possible
enhancement is to index the anchor text with a hashcode of the target URL and
use another hashtable for the mapping between URL hashcode and the real
URL. This double hashing may
possibly enhance the scalability of the anchor indexer.
PaW uses two
postings list, one for the anchor information (for clever score) and another
for all the terms present in the collection. These files must be sorted in
order to generate their final versions that combine document term pairs while
calculating the ranking scores. We used UNIX sort for such task. The main
problem that we faced was that sort creates temporary files and needs almost
double the size of the file being sorted available on disk.
For future
indexing we recommend splitting the postings list in several different files,
based on the term first character. A simple approach is to create a file for
each letter of the alphabet (plus a miscellaneous file for the other
characters). So, for example, the term “university” is stored in the “u”
postings list while the term “cambridge" is stored in the “c” postings
list. Smaller files will be definitely easier to manipulate.
11.
Ranking Performance
Comparison
We tested with different queries and checked the ranking among the first 500 results. We compared the ranking results of TFIDF, TFIDF+PageRank, TFIDF+CLEVER and PaW.
For general term queries, PageRank and CLEVER both perform well. PaW performs best for general term queries.
For content specific queries, PageRank, CLEVER and PaW have comparatively good performance.
TFIDF
performed poorly for both types of queries.
|
TFIDF |
TFIDF +PR |
CLEVER |
TFIDF +CLEVER |
PAW |
|
* |
1 |
1 |
1 |
1. University of Cambridge |
|
NA |
2 |
2 |
2 |
2. University of Oxford Home |
|
NA |
3 |
3 |
3 |
3. The University of Manchester |
|
NA |
4 |
4 |
4 |
4. Bristol University homepage |
|
* |
6 |
5 |
5 |
5. JISC - The Joint Information Systems Committee
|
|
NA |
5 |
* |
* |
6. Client Registration http://admin.applegate.co.uk/applegate/Registers.do?operation=InitRegister |
|
* |
7 |
6 |
6 |
7. UCL - London's Global University |
|
NA |
8 |
7 |
7 |
8. University of Southampton |
|
NA |
9 |
8 |
8 |
9. http://www.nhsdirect.nhs.uk/ |
|
NA |
10 |
9 |
9 |
10. The Open University |
Table 11.1 Rank comparison for
query “cambridge”
|
TFIDF |
TFIDF +PR |
CLEVER |
TFIDF +CLEVER |
PAW |
|
38* |
1 |
1 |
4 |
1. University of Cambridge |
|
NA |
2 |
2 |
5 |
2. The University of Edinburgh |
|
NA |
3 |
3 |
9 |
3. University of Oxford Home page |
|
NA |
4 |
4 |
17 |
4. The University of Manchester |
|
126* |
5 |
5 |
19 |
5. University of Leeds |
|
81* |
8 |
6 |
24 |
6. Bristol University homepage |
|
NA |
9 |
7 |
28 |
7. University of Edinburgh |
|
1 |
6 |
* |
1 |
8. Google University Search |
|
2 |
7 |
* |
2 |
9. Google
University Search http://images.google.co.uk/intl/en/options/universities.html |
|
NA |
10 |
8 |
29 |
10. JISC - The
Joint Information Systems Committee |
Table 11.2 Rank
comparison for query “university”
Tables
11.1 and 11.2 list the results for queries “cambridge” and “university” top 500
results:
·
“*” means that the equivalent URL was not found among
the top 500 results;
·
“*” followed by a number means that there was a page in
the same domain but not the same one;