1. Introduction
Search for anything using your favorite crawler-based search engine. Nearly instantly, the search engine will sort through the millions of pages it knows about and will present you with ones that match your query. The matches will even be ranked, so that the most relevant ones come first. Our second project consists of extending the functionality of the indexer (project1) and incorporating text-based multi-term retriever and ranking system with a Web (CGI) interface. The remainder of this document is organized as follows: Section 2 presents the indexed data that will be used for retrieval. Section 3 lists the resources chosen for the implementation. Section 4 describes EzSearch architecture. Section 5 lists the changes to the indexer in order to support ranking. Section 6 provides the commands required to run the system. Section 7 illustrates how the team is organized. Section 8 lists the documentation and Section 9 lists the references.
2. Data
We will use the data parsed and indexed by the modified version of the project lswww, which in turn uses a subset of documents from a recent crawl of the UK. The indexed data can be found at /proj/searchengines2/project2/EzSearch/words. We were able to parse and index the terms from 85000 documents.
3. Resources
There is a large variety of programming languages that could be taken in consideration for this project. The following sections list the languages we chose, explain why we chose them.
3.1.
Java
Java will be the core language for our implementation. Our choice was mainly based on the fact that Java was the language used for lswww, the implementation we are going to build upon. We also considered our group’s past developing experience in Java as a factor in our decision.
3.1.1 Libraries
·
Tidy.jar: transform an HTML file into a well formed
HTML file.
3.2.
Perl
For the web interface implementation, we decided that Perl would be the best language for that task as a lot of online support is available for the PERL-CGI script and also, we have a basic interface implemented already as a part of Homework1. The basic interface was properly enhanced in order to implement the server protocol.
The CGI basically displays an
editable field where the user will type the query. It also displays the results
received from the server. See Figure
1for
more details.
4. EzSearch Architecture
4.1. Overall
Architecture

Figure 1 - EzSearch Overall Architecture
4.2.
Persistent Storage
The
index is stored on the disk as follows.
4.2.1. Posting List Structure
|
docID |
d' |
Positions |
|
1 |
10 |
pos1,pos2,pos3 |
|
2 |
20 |
pos1 |
|
3 |
11 |
pos1,pos2 |
|
4 |
12 |
pos5 |
This
structure will be implemented using a 3-level HashMap, docID as the key for the
outermost HashMap, term as the key for the 2nd level HashMap, and
the innermost HashMap containing the posting list for the term that will be
mapped using the field ID as the key.
This
information will be periodically flushed to disk. For each term, a file is
created following lswww[2] implementation.
4.2.2. Documents List Structure
|
docID |
URL |
Title |
dl |
|
1 |
http://java.sun.com/j2se/1.3/docs/api/ |
Title 1 |
100 |
|
2 |
http://www.lehigh.edu |
Title
2 |
51 |
The
documents list will hold the list of URLs processed by the parser.
This
information will be periodically flushed into the file doc_index.txt.
4.2.3. Inlinks File
|
Source Document |
Target Document |
|
1 |
2 |
|
2 |
3 |
The file inlinks.txt is generated
in order to store the page’s inlinks. It is used during inlinks retrieval for a
document.
4.2.4. Outlinks File
|
Target Document |
Source Document |
|
2 |
1 |
|
3 |
2 |
The file outlinks.txt is generated
in order to store the pages outlinks. It is used during outlinks retrieval for
a document.
4.2.5. Average Document Length
File
The avdl file is generated by the
indexer. After parsing every document, the document length is accumulated and
at the end of the entire parsing, this sum is divided by the total number of
documents parsed to obtain the average document length.
4.2. The CGI
Interface
The CGI is a Perl interface that
allows a user to send a query to the search engine and check the results.

Figure 2 - CGI Interface
There are four kinds of possible
queries:
·
Multi-term: query composed of one or more terms where
the order that the terms appear is not relevant. In order to rank these
queries, we are using the BM25F algorithm.
·
Phrase: query composed of multi terms surrounded by
quotes. For such queries we will check term proximity using the positions
within documents in order to calculate relevance.
·
Inlinks: returns the pages that contain links to the
given URL/docID. This kind of query is answered using the inlinks index.
·
Outlinks: returns the pages that the given URL/docID
points to. This kind of query is answered using the outlinks index.
The message sent to the server
has the following format:
<query_type> <query>
4.3. Server
The server keeps an offset to the
documents list in memory using a HashMap where the key is the docID. It also
contains a second HashMap in order to be able to translate a URL to a docID
during the inlink/outlink search.
The server receives a request
from the client (CGI) and calls the retriever module. If the search is for
in/outlinks, the retriever accesses the in/outlinks files and send the results
to the client. If the search is a multi-term or a phrase search, the retriever
will retrieve the documents and use the ranker module to rank the results. The
results are sent back to the client.
The response sent to the client
has the following format:
<URL1>;<Title1>\n<URL2>;<Title2>\n…<URLN>;<TitleN>
4.4. Retriever
The
retriever module will call a specific method for each type of query offered by
the interface.
4.4.1.
Phrase Search
The
user may specify that he/she wishes to have an exact phrase match simply by
putting quotes around the query and selecting query type terms. The Ranker is
responsible to calculate the proximity ranking for phrase search. Please refer
to the ranker implementation for further information.
A
HashMap, with docID as the key, is used to store the document list for each
term in the query. After the document lists for all the terms are retrieved,
the docIDs which do not contain all the terms are removed. The resultant
HashMap has only those docIDs which have all the terms. This set is then
checked to see if all the terms are in the same sequence and at the same
distance from each other as in the query. The documents with a mismatch are
removed from the list. The final list of docIDs is then sorted using the
Ranker.
4.4.2.
Multi-Terms Search
If
quotes are not used for query type terms, the ranking will be done solely based
on the BM25F method. The BM25F algorithm was implemented within the Ranker
Module. Please refer to the ranker implementation for further information.
4.4.3.
Inlinks Search
The
inlinks information is stored in the file inlinks.txt. The retriever reads from
this file and returns the list of inlinks’s URL and title to the CGI.
4.4.4.
Outlinks Search
The
outlinks information is stored in the file outlinks.txt. The retriever reads
from this file and returns the list of outlinks’s URL and title to the CGI.
4.5. Ranker
The ranker uses the BM25F
algorithm to calculate the weights of each of the documents in the resultant
set of documents obtained from the retriever.
4.5.1.
BM25F [1]
BMF25F
is an algorithm to improve web search ranking by incorporating field types of
the text in the documents. Document structure (or fields), such as the title,
headings and the anchor text have been shown to be effective in Web Information
Retrieval. The linear combination of scores, which had been the approach,
mostly used for the combination of fields, is difficult to interpret due to the
non-linear relation between the scores and the term frequencies in each of the
fields. In addition, the length normalization that should be applied to each
field depends on the nature of the field. Zaragoza et al.[1] introduced a
field-based version of BM25, called BM25F, which applies length normalization and
weighting of the fields independently. This is an extension of BM25.
The
table below illustrates when and how we calculate the values necessary to
compose the BM25F ranking score. The phase column lists the phase when the
formula is calculated. The storage points how we store such values during the
process.
|
Phase |
Formula |
Storage |
|
Parser/Indexer |
d’ = ∑ (f=1 to k) vf . d[f] |
Postings List |
|
Parser/Indexer |
dl = document length |
Documents List |
|
Parser/Indexer |
avdl = average doc length |
avdl file |
|
Retriever |
Wj(d,C) and W(d,q,C)[1] |
n/a |
Table 1 - BM25F Calculations and
Storage Requirements
The
indexer was extended so that most of the BM25F values for each term were pre-calculated during the indexing process.
In
addition to using the BM25F algorithm to generate weights for each individual
term when multiple terms are involved, the ranker has an additional method for
handling it. The individual weights for each term are summed together, while
keeping track of the missing terms. At the end of summing the weights, the total
weight is divided by (1+n), where n is the number of terms missing. Instead of
outright eliminating a document that is missing one or more terms in a
multi-term query, it is instead penalized based on how many terms it is
missing. This will result in better recall without too much of a negative
impact on precision. In addition, if one of the terms in the query is not in
any documents, then it will effectively be ignored instead of having the query
return nothing. After the final total score for every document has been
computed, the documents are sorted according to weight with an efficient merge
sort algorithm.
5. lswww
lswww is the implementation that we built upon. We chose this indexer because it stores the field type information of the text being parsed which is a major requirement for implementing the BM25F ranking function, as well as the fact that it is modularized so it will be easy to make modifications to it. It also already supports many of the capabilities that are needed, such as listing of positions of terms.
The main features that will be
added to the lswww indexing are:
·
Creation of inlinks and outlinks files: used to provide
inlink/outlink search based on the anchor text information.
· Addition of the weighted sum of the term frequencies in a document, d’, to the postings list.
· Addition of the term positions to the postings list in order to offer phrase search.
· Modifications to the doc_index file and the term files to accommodate the storage of pre-calculated values of some of the terms required in the BM25F formula.
· Creation of an offset file for the documents list in order to speed up the search by reducing the disk I/O.
6. Running the System
To run the system, the following
steps must be executed:
·
Parser
o
Parse the archive: java EzSearch/ezsearch -build summary0-400.warc.gz
o
Sort links file: sortFiles.sh
·
Start the
server (make sure you use xena to run the server)
o
java EzSearch/Server start
·
Send a
query though the CGI interface
o
http://wwwtest.cse.lehigh.edu/~ffp206/EzSearch.cgi
7. Assignment of Tasks
Every module has a focal point:
the member who was mainly responsible for making sure the tasks were performed
and the best results achieved.
Every member closely participated
in the ranking implementation, what we considered the main module for this
project.
Fabiana Prabhakar
·
Team Leader;
· Focal point for: client, server and javadoc.
Shruti Bhandari
·
Focal point for:
documentation and ranking
Suyun Gu
·
Special focus on the
challenges of multi-term
searching.
8. Documentation
8.1. Design
document
http://www.cse.lehigh.edu/~sug2/ezsearch
8.2. Javadoc
http://wwwtest.cse.lehigh.edu/~ffp206/ezsearch/doc/
8.3. Implementation Document
http://www.cse.lehigh.edu/~ffp206/ezsearch/implementation.htm
9. References
[2] lswww
Implementation Document: http://wwwtest.cse.lehigh.edu/~ffp206/lswww.htm