Modern web search engines are a complex set of modules that interact in order to give the user relevant results in a reasonable amount of time. Our first project consisted of building a parser and an indexer for our WeKanSerch search engine. The remainder of this document is organized as follows: Section 2 presents the data that was processed and indexed. Section 3 lists the resources we chose for the implementation and why. Section 4 describes WeKanSerch architecture. Section 5 details our project’s results, such as its memory and disk usage and its performance. Section 6 lists the documentation, and Section 7 lists the references.
We used a subset of documents
from a recent crawl of the UK. The compressed file can be found in
/proj/searchengines/data/summary0-400.warc.gz which accounts for 440k
documents.
The following sections list the
languages we chose, explain why we chose them, and list which libraries we used.
3.1.
Java
Java was the core language for our implementation. Our choice was mainly based on the fact that Java offers great HTML and API implementation support. Its well-designed, intuitive set of APIs promised to considerably reduce development time, also leading to a better code with fewer bugs than other platforms. We also considered our group’s past experience developing in Java as a factor in our decision.
The Java classes do all the major
processing of the HTML documents.
They extract words for our indices and save them into files. After all the files are created, the
indices are compressed using Elias’s Gamma Code (see
Section 4.7). Also, the API, which
interfaces with all the files, is implemented in Java (see Section 4.8). Finally, Java provides the test program
for the API.
3.1.1. Libraries
·
The Swing HTML Parser
The
HTMLEditorKit class offers an efficient set of methods for HTML parsing
support.
·
HashMap Class
The
HashMap class will be used as the main structure to manipulate the index and
posting list information.
· BitSet
Class
The
BitSet class implements a vector of bits that grows as needed. This class will
be used for bit manipulation during compression/de-compression of the posting
list.
3.2.
Perl
For the data extraction from the web archived files, we decided that Perl would be the best language for that task due to its superiority in text processing. It has built-in support for regular expression matching, allowing for a simple method of obtaining the essential content of the HTML files.
Our Perl script acts as a tool
for the main Java indexer; the script analyzes the web archived file and
returns the relevant URLs and hypertext to the Java program. We run the Perl script in parallel with
the Java program. The two
communicate with each other through a socket implementation through a specified
port (we used 12345). The output
from the gzcat utility (used on the summary0-400.warc.gz file) is piped into
the Perl script, which treats it as standard input. This allowed for simple processing of the file. The script waits for messages from the
Java program. Messages start the
script, tell it to send the next processed HTML section, and save its
progress. The script stops when it
is out of input and sends a message to the Java program to indicate as much.
3.2.1 Modules
·
IO::Socket
This module was key for socket
communication with the Java indexer.
4.1.
Dictionary Structure
|
TermID |
Term |
|
1 |
Term1 |
|
2 |
Term2 |
The
dictionary will hold the list of terms generated by the parser.
This
information will be periodically flushed into the file Dictionary.
4.2. Global
Posting List Structure
|
docID |
termID |
Positions |
|
1 |
1 |
pos1,pos2,pos3 |
|
2 |
1 |
pos1 |
|
3 |
2 |
pos1,pos2 |
|
4 |
2 |
pos5 |
This
structure will be implemented using HashMaps, with the term as the key and the
remaining information stored in an object that will be mapped using this key.
This
information will be periodically flushed into the file PostingList.
4.3. Title
Posting List and Anchor Posting List
|
docID |
termID |
Frequency |
|
1 |
1 |
10 |
|
2 |
2 |
20 |
These
structures will be implemented using HashMaps, with the term as the key and the
remaining information stored in an object that will be mapped using this key.
This
information will be periodically flushed into the files TitlePostingList and
AnchorPostingList respectively.
4.4. Documents
List Structure
|
docID |
URL |
|
1 |
http://java.sun.com/j2se/1.3/docs/api/ |
|
2 |
http://www.lehigh.edu |
The
documents list will hold the list of URLs processed by the parser.
This
information will be periodically flushed into the file DocumentsList.
4.5 Links File
|
Source Document |
Target Document |
|
1 |
2 |
|
2 |
3 |
Whenever the HTML Parser finds a
link, this file will be updated, storing which documents are linked.
4.6. Parsing
the Archives, Building the Dictionary of Terms and the Posting List
The following pseudo-code
illustrates the flow the system will follow in order to parse the files and create
the index and posting list.
INPUT_FILE = /proj/searchengines/data/summary0-400.warc.gz;
STOP_LIST; //terms that we do NOT want to index;
WHILE (!done) {
Call Perl script to get file from
INPUT_FILE
Insert Document to Documents List
Structure;
FOR each term {
IF (term is in
an anchor text) {
LINK=the
anchor text link;
IF
(LINK in the Document List) {
DOC_ID
= Document List id;
}
else {
DOC_ID = new
id;
Insert
Document to Documents List Structure;
}
Insert relation into
anchor posting list;
Insert LINK
into links file;
}
IF (term is in
a title){
Insert
relation to the Title Posting List;
}
IF (term in
STOP_LIST) then {
break;
} ELSE {
IF (term
already indexed) then {
Update index
entry for the given term;
Update posting list;
} ELSE {
Add new term
to index;
Add new term
to posting list;
}
//end IF-then-ELSE
}
//end ELSE
} //end FOR each term
} // end WHILE
4.7. Creating
the Compressed Postings List
The
following list represents the steps we implemented in order to create the
compressed posting list.
1.
Read a term from the posting
list
2.
Update the offset fields in
the dictionary file to point into the compressed index
3.
Use Elias’s Gamma Code to
compress the information. We are adding a marker sequence of ‘01’ (zero one)
between each document as a delimiter.
01<doc1pos1…posN>01<doc2pos1…posN>01…01<docMpos1…posN>01
4.8. The API
The API
offers the interface for the search engine. Given a term, the API interacts
with the application in order to find that term in the compressed posting list
and return the results.
4.8.1 API Functions
·
getDocID(String URL);
This method is used to retrieve the document ID for a given URL.
·
getDocURL(int docID);
This method is used to retrieve the URL for a given document ID.
·
getTermID(String term);
This method is used to retrieve the termID for a given term.
· getTerm (int termID);
This method is used to retrieve the term for a given termID.
· searchByTerm (int IndexToSearch, String term);
This method is used to retrieve documents that contain a certain term.
· searchByTermID (int IndexToSearch, int termID);
This method is used to retrieve documents that contain a certain term ID.
· LinksIntoDocument (String URL);
This method is used to retrieve
information about the documents that link to the given document (specified by
the URL input parameter).
· LinksIntoDocument (int DocumentID);
This method is used to retrieve information about the documents that link to the given document (specified by the Document ID input parameter).
· LinksOutOfDocument (String URL);
This method is used to retrieve information about the documents that go out of the given document (specified by the URL input parameter).
· LinksOutOfDocument (int DocumentID);
This method is used to retrieve information
about the documents that go out of the given document (specified by the
Document ID input parameter).
For more information about the
API and how to use it, see Section 6.2 (the
API document has been updated since when we turned it in).
4.9. Overall Architecture
Our parser and indexer will
consist of four main components.
The WarcParser component reads a web archive file and parses out each
set of HTML code for each document.
It then passes the URL of each document and the corresponding HTML text
to the HTMLParser component. The
HTMLParser component parses all of the text passed to it by the
WarcParser. New terms are inserted
in a term dictionary and entries are added to the uncompressed indices. The indices are then handled by the
IndexCompressor. This component
applies a compression algorithm, such as Elias’s Gamma Code, to the indices and
produces the compressed index files.
The final component receives request via a published API and utilizes
the compressed indices for term lookups.
See the diagram below for an illustration of this architecture.

4.10. Running the System
To run the system, the following steps must be executed:
The parser is incremental.
Whenever the parser is restarted, it starts from the last HTML document parsed
and saved to the index. In order to start a new index, the file uid.txt must be
deleted along with all the index files previously generated.
A cleanup and sorting script is
then run on each of the resulting files called cleanupFiles.sh and the
IndexCreator program is run on each index to create the final index format and
the dictionary files to hold the offsets into the eventual compressed files.
Lastly, the Compressor program is
run on each index to convert it into its encoded form.
The APITester program will allow
a user to test the API functionality.
5.1. Memory and Disk Usage
Parser keeps information for a limited number of HTML documents in the memory (parameter max_num_of_docs). The disk contains the complete information necessary for retrieval. The index is stored using the following files (all values are in bytes):
· Main Dictionary (Dictionary.dict) – 13,988,787
· Main Document List (Documents.doc) –
31,087,525
· Uncompressed Title Index (TitlePostingList) – 8,101,683
· Compressed Title Index (TitlePostingList.wks) – 2,280,965
· Title Dictionary (Dictionary.title) – 768,197
· Uncompressed Posting List (PostingList) – 459,849,381
· Compressed Posting List (PostingList.wks) – 135,954,222
· Posting Dictionary (Dictionary.posting) – 22,928,755
· Uncompressed Anchor Index (AnchorPostingList) – 306,189,811
· Compressed Anchor Index (AnchorPostingList.wks) – 44,943,977
· Anchor Dictionary (Dictionary.anchor) – 16,852,640
· In Links Documents List (Documents.inlinks) – 32,430,057
· Compressed In Links Index (InLinks.wks) – 8,356,282
· Out Links Documents List (Documents.outlinks) – 32,416,394
·
Compressed Out Links Index (OutLinks.wks) –
7,857,199
The
total amount of space used for the above files is 1,124,005,875 bytes. Our compression ratio for the posting
list is about 3.4:1. For the title
index, it’s about 3.5:1; for the anchor index, it’s about 6.8:1.
5.2. Capabilities
There is a known bug in the Java API HTMLEditorKit.Parser. Our parser skips HTML documents that are affected by this bug [1]. The complete list of such URLs is stored in the file URLStopList. Our parser uses this list to control which URLs are not supposed to be parsed.
We also faced some problems with very large documents. There is a specific domain that caused us countless problems due to the size of its URLs. We decided to suppress this domain from our collection. Our parser uses the file GroupStopList to control domains that should be ignored.
Despite these limitations, the functionality of each
component of our architecture works.
5.3. Performance
Parser’s performance depends on the size of the HTML files being parsed. For larger documents, the parser will take longer to execute.
On average, on xena, it took us 20 minutes to parse 10,000 files using the max_num_of_docs=100. We kept this parameter low to reduce overhead in case of a crash (especially keeping in mind the bug [1]).
Also
on xena, the retrieval process (including loading the dictionaries, navigating
the interface, running a query, printing the results, and exiting) takes about
30 seconds (tested with the query “britain”).
6.1 Design document
http://wwwtest.cse.lehigh.edu/~ffp206/WeKanSerch.htm
6.2 API
documentation
http://wwwtest.cse.lehigh.edu/~ffp206/WeKanSerchAPI.htm
6.3 Implementation document
http://wwwtest.cse.lehigh.edu/~ffp206/Implementation.htm
6.4 Javadoc
http://wwwtest.cse.lehigh.edu/~ffp206/doc/
[1] Sun Developer Network, http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6506086, updated 2006-12-19