WeKanSerch: Implementation Document  

CSE 345/445 – WWW Search Engines - Spring 2007 – Project 1

Authors: Fabiana Prabhakar; Jay Shipper; Joe Souto; Megan Smith

 

1. Introduction

Modern web search engines are a complex set of modules that interact in order to give the user relevant results in a reasonable amount of time. Our first project consisted of building a parser and an indexer for our WeKanSerch search engine. The remainder of this document is organized as follows: Section 2 presents the data that was processed and indexed. Section 3 lists the resources we chose for the implementation and why. Section 4 describes WeKanSerch architecture. Section 5 details our project’s results, such as its memory and disk usage and its performance.  Section 6 lists the documentation, and Section 7 lists the references.

2. Data

We used a subset of documents from a recent crawl of the UK. The compressed file can be found in /proj/searchengines/data/summary0-400.warc.gz which accounts for 440k documents.

 

3. Resources

The following sections list the languages we chose, explain why we chose them, and list which libraries we used.

 

3.1. Java

 

Java was the core language for our implementation. Our choice was mainly based on the fact that Java offers great HTML and API implementation support. Its well-designed, intuitive set of APIs promised to considerably reduce development time, also leading to a better code with fewer bugs than other platforms. We also considered our group’s past experience developing in Java as a factor in our decision.

 

The Java classes do all the major processing of the HTML documents.  They extract words for our indices and save them into files.  After all the files are created, the indices are compressed using Elias’s Gamma Code (see Section 4.7).  Also, the API, which interfaces with all the files, is implemented in Java (see Section 4.8).  Finally, Java provides the test program for the API.

 

3.1.1. Libraries

 

·       The Swing HTML Parser

 

The HTMLEditorKit class offers an efficient set of methods for HTML parsing support.

 

·      HashMap Class

 

The HashMap class will be used as the main structure to manipulate the index and posting list information.

 

·      BitSet Class

 

The BitSet class implements a vector of bits that grows as needed. This class will be used for bit manipulation during compression/de-compression of the posting list.

 

3.2. Perl

 

For the data extraction from the web archived files, we decided that Perl would be the best language for that task due to its superiority in text processing.  It has built-in support for regular expression matching, allowing for a simple method of obtaining the essential content of the HTML files.

 

Our Perl script acts as a tool for the main Java indexer; the script analyzes the web archived file and returns the relevant URLs and hypertext to the Java program.  We run the Perl script in parallel with the Java program.  The two communicate with each other through a socket implementation through a specified port (we used 12345).  The output from the gzcat utility (used on the summary0-400.warc.gz file) is piped into the Perl script, which treats it as standard input.  This allowed for simple processing of the file.  The script waits for messages from the Java program.  Messages start the script, tell it to send the next processed HTML section, and save its progress.  The script stops when it is out of input and sends a message to the Java program to indicate as much.

 

3.2.1 Modules

 

·      IO::Socket

 

This module was key for socket communication with the Java indexer.

 

4. WeKanSerch Architecture

 

4.1. Dictionary Structure

 

TermID

Term

1

Term1

2

Term2

 

The dictionary will hold the list of terms generated by the parser.

This information will be periodically flushed into the file Dictionary.

 

4.2. Global Posting List Structure

 

docID

termID

Positions

1

1

pos1,pos2,pos3

2

1

pos1

3

2

pos1,pos2

4

2

pos5

 

This structure will be implemented using HashMaps, with the term as the key and the remaining information stored in an object that will be mapped using this key.

This information will be periodically flushed into the file PostingList.

 

4.3. Title Posting List and Anchor Posting List

 

docID

termID

Frequency

1

1

10

2

2

20

 

These structures will be implemented using HashMaps, with the term as the key and the remaining information stored in an object that will be mapped using this key.

This information will be periodically flushed into the files TitlePostingList and AnchorPostingList respectively.

 

4.4. Documents List Structure

 

docID

URL

1

http://java.sun.com/j2se/1.3/docs/api/

2

http://www.lehigh.edu

 

The documents list will hold the list of URLs processed by the parser.

This information will be periodically flushed into the file DocumentsList.

 

 

4.5 Links File

 

Source Document

Target Document

1

2

2

3

 

Whenever the HTML Parser finds a link, this file will be updated, storing which documents are linked.

 

4.6. Parsing the Archives, Building the Dictionary of Terms and the Posting List

 

The following pseudo-code illustrates the flow the system will follow in order to parse the files and create the index and posting list.

 

INPUT_FILE = /proj/searchengines/data/summary0-400.warc.gz;

STOP_LIST; //terms that we do NOT want to index;

WHILE (!done) {

Call Perl script to get file from INPUT_FILE

Insert Document to Documents List Structure;

FOR each term {

IF (term is in an anchor text) {

            LINK=the anchor text link;

            IF (LINK in the Document List) {

                        DOC_ID = Document List id;

            } else {

DOC_ID = new id;

Insert Document to Documents List Structure;

            }

            Insert relation into anchor posting list;

Insert LINK into links file;

}

IF (term is in a title){

            Insert relation to the Title Posting List;

}

IF (term in STOP_LIST) then {

break;

} ELSE {

IF (term already indexed) then {

Update index entry for the given term;

Update posting list;

} ELSE {

Add new term to index;

Add new term to posting list;

                                    } //end IF-then-ELSE

                        } //end ELSE

} //end FOR each term

} // end WHILE

 

4.7. Creating the Compressed Postings List

 

The following list represents the steps we implemented in order to create the compressed posting list.

 

1.     Read a term from the posting list

 

2.     Update the offset fields in the dictionary file to point into the compressed index

 

3.     Use Elias’s Gamma Code to compress the information. We are adding a marker sequence of ‘01’ (zero one) between each document as a delimiter.

01<doc1pos1…posN>01<doc2pos1…posN>01…01<docMpos1…posN>01

 

4.8. The API

 

The API offers the interface for the search engine. Given a term, the API interacts with the application in order to find that term in the compressed posting list and return the results.

 

4.8.1 API Functions

 

·      getDocID(String URL);

 

This method is used to retrieve the document ID for a given URL.

 

·      getDocURL(int docID);

 

This method is used to retrieve the URL for a given document ID.

 

·      getTermID(String term);

 

This method is used to retrieve the termID for a given term.

 

·      getTerm (int termID);

 

This method is used to retrieve the term for a given termID.

 

·      searchByTerm (int IndexToSearch, String term);

 

This method is used to retrieve documents that contain a certain term.

 

·      searchByTermID (int IndexToSearch, int termID);

 

This method is used to retrieve documents that contain a certain term ID.

 

·      LinksIntoDocument (String URL);

 

This method is used to retrieve information about the documents that link to the given document (specified by the URL input parameter).

 

·      LinksIntoDocument (int DocumentID);

 

This method is used to retrieve information about the documents that link to the given document (specified by the Document ID input parameter).

 

·      LinksOutOfDocument (String URL);

 

This method is used to retrieve information about the documents that go out of the given document (specified by the URL input parameter).

 

·      LinksOutOfDocument (int DocumentID);

 

This method is used to retrieve information about the documents that go out of the given document (specified by the Document ID input parameter).

 

For more information about the API and how to use it, see Section 6.2 (the API document has been updated since when we turned it in).

 

4.9. Overall Architecture

 

Our parser and indexer will consist of four main components.  The WarcParser component reads a web archive file and parses out each set of HTML code for each document.  It then passes the URL of each document and the corresponding HTML text to the HTMLParser component.  The HTMLParser component parses all of the text passed to it by the WarcParser.  New terms are inserted in a term dictionary and entries are added to the uncompressed indices.  The indices are then handled by the IndexCompressor.  This component applies a compression algorithm, such as Elias’s Gamma Code, to the indices and produces the compressed index files.  The final component receives request via a published API and utilizes the compressed indices for term lookups.  See the diagram below for an illustration of this architecture.

 

4.10. Running the System

 

To run the system, the following steps must be executed:

 

  1. Start the server: gzcat <archive_file_name> | perl server.pl
  2. Start the client: java -classpath  /proj/searchengines/project1/wekanserch wekanserch.Parser <max_num_of_docs> <index_path>
  3. Run cleanupFiles.sh in directory with data files
  4. Make sure the main in IndexCreator has "idxCreator.createPostingsIndex();" as the uncommented line.  Run IndexCreator <inputIndex> <outputIndex> for the general posting list.
  5. Make sure the main in IndexCreator has "idxCreator.createFrequenciesIndex();" as the uncommented line.  Run IndexCreator <inputIndex> <outputIndex> for the anchor index and title index.
  6. Make sure the main in IndexCreator has "idxCreator.createLinksIndex();" as the uncommented line.  Run IndexCreator <inputIndex> <outputIndex> for the input and output links indexes.
  7. Make sure the main in IndexCreator has "idxCreator.createDocsIndex();" as the uncommented line.  Run IndexCreator <inputList> <outputList> for the document list.
  8. For each index, run the compressor in the directory with the compressed index files:
    java /proj/searchengines/project1/wekanserch/wekanserch/Compressor <inputIndex> <outputIndex> <mainDictionary> <outputDictionary>
  9. Run the API in the directory with the compressed index files:
    java -classpath /proj/searchengines/project1/wekanserch wekanserchAPI.APITester

 

The parser is incremental. Whenever the parser is restarted, it starts from the last HTML document parsed and saved to the index. In order to start a new index, the file uid.txt must be deleted along with all the index files previously generated.

 

A cleanup and sorting script is then run on each of the resulting files called cleanupFiles.sh and the IndexCreator program is run on each index to create the final index format and the dictionary files to hold the offsets into the eventual compressed files.

Lastly, the Compressor program is run on each index to convert it into its encoded form.

 

The APITester program will allow a user to test the API functionality.

 

5.  Results

5.1. Memory and Disk Usage

Parser keeps information for a limited number of HTML documents in the memory (parameter max_num_of_docs). The disk contains the complete information necessary for retrieval. The index is stored using the following files (all values are in bytes):

·      Main Dictionary (Dictionary.dict) – 13,988,787

·      Main Document List (Documents.doc) – 31,087,525

·      Uncompressed Title Index (TitlePostingList) – 8,101,683

·      Compressed Title Index (TitlePostingList.wks) – 2,280,965

·      Title Dictionary (Dictionary.title) – 768,197 

·      Uncompressed Posting List (PostingList) – 459,849,381

·      Compressed Posting List (PostingList.wks) – 135,954,222

·      Posting Dictionary (Dictionary.posting) – 22,928,755

·      Uncompressed Anchor Index (AnchorPostingList) – 306,189,811

·      Compressed Anchor Index (AnchorPostingList.wks) – 44,943,977

·      Anchor Dictionary (Dictionary.anchor) – 16,852,640

·      In Links Documents List (Documents.inlinks) – 32,430,057

·      Compressed In Links Index (InLinks.wks) – 8,356,282

·      Out Links Documents List (Documents.outlinks) – 32,416,394

·      Compressed Out Links Index (OutLinks.wks) – 7,857,199

The total amount of space used for the above files is 1,124,005,875 bytes.  Our compression ratio for the posting list is about 3.4:1.  For the title index, it’s about 3.5:1; for the anchor index, it’s about 6.8:1.

5.2. Capabilities

There is a known bug in the Java API HTMLEditorKit.Parser. Our parser skips HTML documents that are affected by this bug [1]. The complete list of such URLs is stored in the file URLStopList. Our parser uses this list to control which URLs are not supposed to be parsed.

We also faced some problems with very large documents. There is a specific domain that caused us countless problems due to the size of its URLs. We decided to suppress this domain from our collection. Our parser uses the file GroupStopList to control domains that should be ignored.

Despite these limitations, the functionality of each component of our architecture works.

5.3. Performance

Parser’s performance depends on the size of the HTML files being parsed. For larger documents, the parser will take longer to execute.

On average, on xena, it took us 20 minutes to parse 10,000 files using the max_num_of_docs=100. We kept this parameter low to reduce overhead in case of a crash (especially keeping in mind the bug [1]).

Also on xena, the retrieval process (including loading the dictionaries, navigating the interface, running a query, printing the results, and exiting) takes about 30 seconds (tested with the query “britain”).

 

6.  Documentation

6.1 Design document

http://wwwtest.cse.lehigh.edu/~ffp206/WeKanSerch.htm

 

6.2 API documentation

http://wwwtest.cse.lehigh.edu/~ffp206/WeKanSerchAPI.htm

 

6.3 Implementation document

http://wwwtest.cse.lehigh.edu/~ffp206/Implementation.htm

 

6.4 Javadoc

http://wwwtest.cse.lehigh.edu/~ffp206/doc/

 

 

7. References

[1]  Sun Developer Network, http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6506086, updated 2006-12-19