A SWAT team draws power rapidly and accurately from a subset of the internet.
Imagine getting good Internet search results from an imprecise query like “What are the trends in Pennsylvania’s economy?” Or empowering a software agent to scour multiple airline databases and decide the best flight based on your preferences—and suggest things you might like to do at your destination.
That is the potential of the semantic web, a subset of the web that has been augmented to allow computers to make implicit connections between data.
“The semantic web synthesizes information from across the web to bring more power to us,” says Jeff Heflin, associate professor of computer science and engineering.
Heflin has been “working in this field as long as it has been a field.” He investigated knowledge representation on the web as a graduate student at the University of Maryland in the late 1990s, before the seminal Scientific American article by Sir Tim Berners-Lee and others introduced the field to the wider world in 2001.
The nearly one billion sites on the World Wide Web are created with hypertext markup language (HTML), which specifies the layout and style of web pages. Search engines like Google and Yahoo look for keywords in the text and generate good matches if you have queried for the right terms. But, Heflin asks, “What do we do if we don’t know how to phrase the question precisely?”
The semantic web will help bridge that gap. Code written to understand data formats can search and find relationships between facts in publicly accessible databases that don’t show up in web searches. Governments, research agencies and corporations have made some 60 billion “facts” available on the web, Heflin says, and the number is growing. There is also commerce and personal data from websites, merchants and social media sites that isn’t free but may be sold for others to analyze.
Languages such as Resource Description Framework (RDF) and Web Ontology Language (OWL) represent data in graphs—bits of information linked by specific relationships. For example, John’s personal site URL might be linked by a property “last name” to a field of data “Doe,” or by “gender” to “male.” Software can traverse the links between many databases to reason that “Jane” is his “sister” or “Lehigh” is his “employer.”
With tens of thousands of bits of data on the web for any individual or entity, finding the needle in this haystack is a lengthy, computing-intensive process. Heflin and his team at the Semantic Web and Agent Technologies (SWAT) Lab are developing algorithms and code that can make connections rapidly and accurately.
The researchers scale algorithms and code to work quickly across large data sets. Automatically determining matches between identifiers in different databases is another area of interest. And they design interfaces that allow ordinary users to effectively search the semantic web. The semantic web’s parent, artificial intelligence, uses very small data sets of a few thousand items, Heflin says.
“I want to do accurate searches at the scale of billions of items. I focus on techniques that take a very large subset of the web and rapidly find matches among identifiers.” Success in this field is measured by precision (how many of the data links that you find are on topic?) and recall (how many of the links on a given topic do you find?).
“There are two ways you can be wrong,” Heflin says. Finding one accurate match is precise but shows low recall.
Returning a broad swath of links may have high recall but is imprecise. The trick is finding the sweet spot.
Data on the semantic web can be represented in graphs that display the links between information. In the case of John Doe, his web address and “Doe” can be seen as two ovals, linked by a line representing the property “last name.” A single record in a database can have many properties.
To help make this web of data more useful, Heflin’s team focuses on scalability—quickly filtering out data properties that do not match.
To find “high co-occurrence” between the values that populate these properties, Heflin compares graphs about the individuals in question. (A key contributor to this effort, Dezhao Song ’13 Ph.D., is now with Thompson Reuters.) First name fields have many recurring values. Zip codes and Social Security numbers have common properties as number strings of certain sizes.
With billions of pieces of data across the semantic web, finding links can require making a billion billion comparisons, eating up time and computing resources. Heflin leverages a search engine concept known as an “inverted index”—a list of all the values for a given property—to quickly filter out data pairs that are not likely to match. No other research teams, he says, have tried this technique of instance matching “at such a large scale.”
By looking for strings that are comparable in size or have a lot of shared information, the technique also has the benefit of being domain independent. “It does not require any specialized knowledge about the data topic to work,” Heflin says.
The SWAT system’s scalability and improved candidate matching get world-class results.
Each year researchers compare their algorithms through the Ontology Alignment Evaluation Initiative (OAEI). Performance of semantic web software is measured in F-scores, which combine ratings of the code’s precision and recall. Heflin’s team has not entered, but it has compared its systems to the top contenders and other published results.
“Our software was twice to an order of magnitude better than other systems,” he says. “We were getting comparable or better accuracy, with significantly faster times. Using a data set of 50,000 item descriptions drawn from real world Linked Data, we achieved F-scores that were over 20 percent better than the next best system, and we did it 10 times more quickly. This data set includes a diverse array of items—people, places, books, products, chemicals, proteins and more.”