PaW: Implementation Document

CSE 445 – WWW Search Engines - Spring 2007 – Project 3

Authors: Fabiana Prabhakar and YaoShuang Wang

 

 

1.   Introduction

 

During Project 2 we incorporated text-based multi-term retriever and ranking system with a Web (CGI) interface. Our third project consists of extending the Project 2 implementation to incorporate link-based authority ranking. The remainder of this document is organized as follows: Section 2 presents the indexed data that was used for retrieval. Section 3 lists the resources chosen for the implementation. Section 4 shows the PaW interface. Section 5 describes PaW’s architecture. Section 6 lists the algorithms we implemented. Section 7 lists the changes to the indexer in order to support ranking. Section 8 lists the changes to the Client/Server module. Section 9 lists the main files used by our Search Engine. Section 10 points some concerns about the Scalability of our solution. Section 11 shows a comparison between TFIDF and PaW score performance. Section 12 lists the necessary steps to run the system. Section 13 illustrates how the team is organized. Section 14 lists the documentation and Section 15 lists the references.

 

 

2.   Data

 

We used a set of documents from a recent crawl of the UK. The compressed files can be found in /proj/searchengines/data/summary*.warc.gz which accounts for eight archives containing approximately 440k documents each.

 

 

3.   Resources

 

There is a large variety of programming languages that could be taken in consideration for this project. The following sections list the languages we chose, explain why we chose them.

Java and C were the core languages for our implementation. Our choice was mainly based on the fact that they were the languages used for RankIt, the implementation we are going to build upon.

 

 

3.1. Java

 

All the modules but the Parser are written in Java. We considered our past developing experience in Java as a factor in our decision.

 

3.2. C

 

C is the language used for the Parser. It’s fast at dealing with a large amount of data according to the experience of two previous projects. We continued using C for parsing as needed by the ranking algorithm and linkage analysis algorithm.

 

 

4.   PaW Interface

 

The CGI is a Perl interface. It allows a user to send a query to the search engine and check the results.

 

PaW CGI: http://wwwtest.cse.lehigh.edu/~ffp206/PaW.cgi

 

 

5.   PaW Architecture

 

5.1. Overall Architecture

 

The Server loads the offsets to the Documents List and Postings List to memory during startup. The Retriever is responsible to call the Ranker’s appropriate method depending on the chosen algorithm (TF-IDF or PaW Score) through the CGI interface.

The Ranker accesses the postings list, sorts the results and returns them to the Retriever. The Retriever is responsible to add additional information such as title and URL to each document and send the results back to the Server.

 

 

 

5.2. Offline Processes

 

In addition to the Parser, we used some offline processes for additional tasks, such as:

 

·      In/Outlinks generation;

·      PageRank calculation;

·      TF-IDF calculation;

·      Anchor text (including extra text within the sliding window) weighting;

·      Final postings list generation.

·      Extraction of document information such as title and URL for the retriever.

 


5.3. URL Normalization [1][2]

 

Table 1 lists the URL Normalization that will be implemented for PaW.

 

Normalization

Equivalent URLs

Replace %7E by tilde

 

http://www.this%7Eab.com/

http://www.this~ab.com/

Case Normalization (just for the host)

 

HTTP://www.EXAMPLE.com/AB

http://www.example.com/AB

Replace space by %20

 

http://www.this ab.com/

http://www.this%20ab.com/

Remove Dot Segments

 

/a/b/c/./../../g

/a/b/c/g

 

Scheme-Based Normalization

 

http://example.com

 

http://example.com/

http://example.com:/

http://example.com:80/

Table 1 - PaW URL Normalization

 

 

6.   Ranking Algorithms

 

The PaW Score is a combined score considering TF-IDF, PageRank and the Modified CLEVER scores.

 

6.1. TF-IDF

 

The TF-IDF weight was calculated offline and stored in the postings list.

The term frequency in the given document is simply the number of times a given term appears in that document. This count was normalized to prevent a bias towards longer documents. The inverse document frequency is a measure of the general importance of the term.

 

Term frequency: tfi = ni / ∑k nk

Inverse document frequency: idfi = log ( |D| / |{d : d ti}|

with

 

    * |D| : total number of documents in the corpus

    * |{d : d ti }|: number of documents where the term ti appears (that is n ≠ 0).

 

tfidf  = tf * idf

 

 

 

6.2. PageRank

 

We implemented PageRank with random surfer d=0.85  by the following equation:

 

 

     d is the damping factor, we use 0.85 in our implementation

     li,j represent the page i links to page j

     Wi and Wj represent the authority value of page i and j.

     ni is the total outlinks of page i.

At every iteration, page j’s authority is the sum of its inlinks’ source page i’s authority divided by i’s outlink size times d plus 1-d.

We initially assign each page’s authority to be 1, dampening factor 0.15. The final rank has a value always greater than 0.15.

Dangling nodes:

There are pages with incoming links but no outgoing links. The nodes without outgoing links are called dangling nodes.  The existence of the dangling nodes will cause rank sink in the PageRank calculation. We consider the proper dealing with dangling node will improve our calculated PageRank quality. We chose to implement similar to jump-weighted approach for dealing with dangling nodes. We collect the dangling nodes’ total authority score and do biased redistribution so that dangling pages doesn’t receive less of this rank. The dangling nodes will only get the random surfer dampening factor and the authority from its incoming links. The non-dangling node will get extra authority of the average dangling authority.

Performance:

One iteration of PageRank calculation takes about 11267 ms for 3370169 links. We used the Euclidean Distance [3] as a measure of convergence. After the distance of two PageRank is smaller than 0.1, we finish the calculation and consider it to be converged. It took 46 iterations for PageRank to converge within distance of 0.1.

Figure 6.1 Convergence of PageRank with avgDanglingAuthority.

Figure 6.2 Convergence of PageRank without avgDanglingAuthority.

Figure 6.1 is the convergence curve of PageRank value with dangling nodes handling. Figure 6.2 is the convergence curve without handling the dangling nodes. You can see that due to the introduction of average dangling authority, the convergence curve of Figure 6.1 is not smooth, but the PageRank calculation converges faster than the calculation without handling dangling nodes.

 

6.3. Modified CLEVER

 

We choose to index the anchor texts and texts within the sliding window range with the target URL. This provides a global view of target page. The words in the anchor text are given higher weight than the texts in the sliding window. This means anchor text plays a more important role in describing the target page. Every page will be associated with a CLEVER weight according to the current query term. Besides the linkage analysis with PageRank, the modified CLEVER provides more sensitivity for our ranking algorithm.

 

     We index the anchor text with weight 2 with the target URL. We index extra 10 words with weight 1 around the anchor text both backward and forward.

We consider using 0.2*(totalWeight) as the CLEVER score in our implementation to combine with the PageRank and tfidf.

 

7.   RankIt Parser

 

RankIt is the parser we extended. The main features that were added to the RankIt indexing are:

 

     Extract outlinks of each page, and create intermediate inlink file combining the information extracted from each of the eight zip files. The normalization module uses both files in order to create the final links files.

     Modify the anchor text indexer to index texts within the sliding window range.

     While indexing the anchor text to target URL, give different weight according to the text location.

     Extend parser to index terms and create dictionary for the whole collection (eight zip files).

 

 

8.   EzSearch Client/Server Architecture

 

We used EzSearch Client/Server implementation because it is simple and yet has a great performance compared to the others. The minimal socket communication and optimized ranker made EzSearch the best option.

 

The main features that were added to EzSearch Client/Server implementation are:

 

·      Change in the interface to accommodate the choice of ranking algorithm (TF-IDF/ Modified PageRank);

·      Adjust the server structures in order to load the necessary information to the memory;

·      Adjust the Ranker structures.

 

 


9.   Disk Usage

 

The following table lists the files created, a brief description, their format and size.

 

File Name

Description

Format

Size (bytes)

new_doc_url

Maps URLs to docIDs

URL docID

969751026

docinfo

Lists the URL and Title for each docID. Title might contain multiple lines.

>

docID

URL

Title

1052347439

docinfooffset

docinfo offset file.

docID offset

217826129

StopList

Contains the stop list. A term per line.

 

1965

final_rank

Stores the PageRank.

docID PageRank

53731328

doclen

Stores the length of each document.

docID length

38914283

linksfilelist

Contains a list of documents to be processed by CreateInOutLinks (outlinks_*)

file name

88

outlinks_final

Maps a docID to its outlinks.

docID outID outID

568673959

inlinks_final

Maps a docID to its inlinks.

docID inID inID

554276070

contentdict

Postings List

>term docID freq TFIDF PaW

17686057471

contentoffset

Postings List Offset

term offset

75361428

 

 

10.        Scalability

 

The anchor indexer doesn’t scale well with the hashtable growing. Now the indexing of 100k pages takes about one hour. As we are indexing the anchor text with its target URL, the inverted file list keeps the target URLs. Before inserting a new term paired with a URL, we will go through the linked inverted file list and check if we are inserting a new URL. This process slows down with the hashtable grows.

One possible enhancement is to index the anchor text with a hashcode of the target URL and use another hashtable for the mapping between URL hashcode and the real URL.  This double hashing may possibly enhance the scalability of the anchor indexer.

PaW uses two postings list, one for the anchor information (for clever score) and another for all the terms present in the collection. These files must be sorted in order to generate their final versions that combine document term pairs while calculating the ranking scores. We used UNIX sort for such task. The main problem that we faced was that sort creates temporary files and needs almost double the size of the file being sorted available on disk.

For future indexing we recommend splitting the postings list in several different files, based on the term first character. A simple approach is to create a file for each letter of the alphabet (plus a miscellaneous file for the other characters). So, for example, the term “university” is stored in the “u” postings list while the term “cambridge" is stored in the “c” postings list. Smaller files will be definitely easier to manipulate.

 

11.        Ranking Performance Comparison

 

We tested with different queries and checked the ranking among the first 500 results. We compared the ranking results of TFIDF, TFIDF+PageRank, TFIDF+CLEVER and PaW.

For general term queries, PageRank and CLEVER both perform well. PaW performs best for general term queries.

For content specific queries, PageRank, CLEVER and PaW have comparatively good performance.

TFIDF performed poorly for both types of queries.

 

TFIDF

TFIDF

+PR

CLEVER

TFIDF

+CLEVER

PAW

*

1

1

1

1. University of Cambridge

http://www.cam.ac.uk/

NA

2

2

2

2. University of Oxford Home

http://www.ox.ac.uk/

NA

3

3

3

3. The University of Manchester

http://www.manchester.ac.uk/

NA

4

4

4

4. Bristol University homepage

http://www.bris.ac.uk/

*

6

5

5

5. JISC - The Joint Information Systems Committee

http://www.jisc.ac.uk/

NA

5

*

*

 

6. Client Registration  http://admin.applegate.co.uk/applegate/Registers.do?operation=InitRegister

*

7

6

6

7. UCL - London's Global University

http://www.ucl.ac.uk/

NA

8

7

7

8. University of Southampton

http://www.soton.ac.uk/

NA

9

8

8

9. http://www.nhsdirect.nhs.uk/

http://www.nhsdirect.nhs.uk/

NA

10

9

9

10. The Open University

 http://www.open.ac.uk/

Table 11.1 Rank comparison for query “cambridge”

 

 

 


 

TFIDF

TFIDF

+PR

CLEVER

TFIDF

+CLEVER

PAW

38*

1

1

4

1. University of Cambridge

http://www.cam.ac.uk/

NA

2

2

5

2. The University of Edinburgh

http://www.ed.ac.uk/

NA

3

3

9

3. University of Oxford Home page

http://www.ox.ac.uk/

NA

4

4

17

4. The University of Manchester

http://www.manchester.ac.uk/

126*

5

5

19

5. University of Leeds

http://www.leeds.ac.uk/

81*

8

6

24

6. Bristol University homepage

http://www.bris.ac.uk/

NA

9

7

28

7. University of Edinburgh

http://www.ed.ac.uk/contact.html

1

6

*

1

8. Google University Search

http://maps.google.co.uk/intl/en/options/universities.html

2

7

*

2

9. Google University Search

http://images.google.co.uk/intl/en/options/universities.html

NA

10

8

29

10. JISC - The Joint Information Systems Committee

http://www.jisc.ac.uk/

Table 11.2 Rank comparison for query “university”

 

Tables 11.1 and 11.2 list the results for queries “cambridge” and “university” top 500 results:

·      “*” means that the equivalent URL was not found among the top 500 results;

·      “*” followed by a number means that there was a page in the same domain but not the same one;