HARNESSING THE POWER OF AN OCEAN OF DATA
The save for the final game of the 2014 World Series went to the San Francisco Giants’ pitching ace, Madison Bumgarner, but it took a software program to haul his bacon out of the fire.
The Giants were leading the Kansas City Royals 3-2 in the bottom of the fifth inning when they sent Bumgarner, their stellar starter, to the mound in relief. But Bumgarner gave up a leadoff hit, and the runner advanced to second base. With one out, Nori Aoki stepped up to the plate.
On a 2-1 pitch, Aoki sliced a line drive toward the corner in left field. It looked like a potentially game-tying base hit, but as the camera tracked the ball in flight, left fielder Juan Perez appeared out of nowhere, loped toward the foul line, and made a routine catch.
The TV analyst praised Perez but noticed something else. “Great jump, but look where they had him sitting. This ball off the bat looked like a double in the corner.”
Ben Jedlovec, watching the game, smiled to himself. Jedlovec is president of Baseball Info Solutions, a leading baseball data and analytics company. Among his clients are 22 of the 30 Major League Baseball teams, including the Giants. Jedlovec, an adjunct professor in the College of Business and Economics (CBE), also teaches a popular class in sabermetrics at Lehigh.
“We have a product that shows that when Aoki hits the ball to the outfield,” says Jedlovec, “he tends to slap it shallow and to the opposite field. That’s why Perez was positioned way over near the line. If he’s in a regular position on that play, it’s a double and the run scores standing up. Instead, Perez makes it look easy.”
The Giants won the game, 3-2, and the World Series, 4 games to 3.
The story of last year’s Game 7, says Dan Lopresti, illustrates one of the countless ways that data analytics is solving modern problems. Advertising and crop watering, cancer treatment and epidemiology, Internet searching and climate forecasting are a few of the other applications. Indeed, says Lopresti, chair and professor of computer science and engineering, almost every area of modern life has benefited from the ability to interpret and harness the terabytes of data generated by digital technology.
Lopresti, an expert in pattern recognition bioinformatics and computer security, is also director of Data X, a new initiative that will infuse the teaching of computer and data science into all the university's colleges.
Launched earlier this year, Data X will initially support the hiring of 18 new faculty members in three colleges—12 in computer science and engineering and six more with expertise in consumer analytics, digital media and bioengineering.
The digitization of modern life
Data X will also help Lehigh meet growing student demand for courses in computer science and engineering, where enrollment has risen by 163 percent in five years.
“During my senior year, I took a data mining course with Prof. Lopresti,” says Glenn DuPaul ’14, an economics major. “There were students from across Lehigh’s majors; every seat was taken. We had to find a larger classroom. Data science and analytics is growing quickly, as is demand among students. People are saying that data is this generation’s oil.”
At the base of the new initiative, says Lopresti, is the explosion of data brought about in the last two decades by the revolution in digital technology and the Internet.
Data, says Lopresti, is generated by myriad devices — smartphones, tablets, PCs, traffic sensors, security cameras — that are connected to the Internet. Data mining, machine learning and search engines make data accessible; imagination and the ability to interpret that data are the key to transforming it into valuable, usable information.
“Many areas of life––from entertainment to TV, from music to our social interactions, things not formerly regarded as data––are now digitized and converted into bytes,” says Lopresti.
“In the early days of the Internet, only a few computers were connected. Then it was millions. Now, we’re approaching the point where every device imaginable will soon be connected to the Internet. This is creating an ocean of data.”
It is also creating an opportunity—and an obligation—for the 21st-century university. According to government estimates, 1.4 million new jobs in computing will open by 2020, but only 400,000 college graduates will be qualified to fill them.
“Students today need data smartness as well as computational thinking to take advantage of this new mass of data,” says Lopresti. “They need to understand data and how it can be useful. They need to understand algorithms, Internet connectivity and machine learning that can aid in analyzing this complicated, messy, incomplete data in real time. And they need the critical thinking skills that a top-notch liberal arts university like Lehigh can provide.”
The data wave is creating demand for a new kind of interdisciplinary curriculum that combines data literacy with proficiency in other disciplines, says Lopresti.
“The sheer pervasiveness of technology has created enormous opportunity, along with a high degree of complexity, in nearly every discipline,” says Lopresti. “You can no longer be educated in any one area without understanding the impact and the importance of data analytics in that field.”
At the same time, he adds, “It is becoming much easier to incorporate computer science into other disciplines.
It’s easier to use programming and other computing techniques. And young people are growing up using computers everywhere for everything.”
Natural places for collaboration
Data X is designed to have a particularly positive impact on three academic programs—bioengineering, marketing, and journalism—while bolstering Lehigh’s computational research and educational capabilities.
Anand Jagota, director of the bioengineering program, says data literacy is helping to spur innovation in the field. “Point of care and diagnostic devices will lead to a new way of making diagnoses more personal, while hopefully reducing costs,” says Jagota. “To do this, we need to deal with big streams of data.
“In addition, studying the properties of cell structure, and how molecules interact and fold are basic questions in bioengineering. The theoretical study of these phenomena requires very large computations and data visualization, so it’s a natural place for computer science and bioengineering to collaborate.”
Xiaolei Huang, associate professor of computer science and engineering, and Chao Zhou, assistant professor of electrical and computer engineering, analyze breast tissue images using optical coherence microscopy and tomography to produce computeraided diagnoses. Their analysis is data-intensive, automated and designed to provide real-time information to help surgeons minimize the tissue they remove while operating on cancer patients.
“The process takes a large number of images, and labels the types of tissue in the sample,” says Huang. “For every pixel in the image, we know whether it is fat, carcinoma, etc. In addition, we extract thousands of different features thatcan be present in the image, such as texture, color or local contrast, and we use a machine learning algorithm to select which features are the most discriminating.”
The results, she says, are markedly superior to the visual diagnostic review that doctors generally use with current medical imaging.
Javier Buceta, associate professor of chemical and biomolecular engineering, and Paolo Bocchini, assistant professor of civil and environmental engineering, are developing a stochastic model to forecast the probable spread of Ebola. Buceta is an expert in modeling biological systems; Bocchini focuses on the response of infrastructure to earthquakes, fires and other disasters.
“Although Paolo and I have very different outcomes in our individual research,” says Buceta, “we speak the same mathematical and computational language.”
Using functional quantization, a tool initially designed to track the stock market, the researchers are creating a “hazard map” that quantifies the probability of Ebola outbreaks at specific locations in a vast geographical area of Africa. Their goal is to help authorities react quickly to Ebola’s spread and concentrate available resources where they are most effective.
Data analytics offers new insights into marketing, says Geoffrey Colon ’94, communications designer and social data expert at Microsoft.
“We don’t make things in a silo anymore and just unleash them on customers,” says Colon. “I can look at 50,000 conversations that people have had about Microsoft products, all packed into a sort of spreadsheet. There’s humanity behind all that, and we have entirely new ways of going to market with products.”
Colon notes that in a recent Fortune list of tech startups valued at over $1 billion, three of the top 10 were founded by people from non-tech fields, such as art and design or social theory.
“Data is crossing into all majors and disciplines –– the humanities, political science and product development. Data X is exactly where education needs to move.”
In journalism and communication, says department chair Jack Lule, data science is giving the art of storytelling a makeover.
“When people think of journalism, they usually think of storytelling with words and photographs,” says Lule. “Computer science now provides insights into every aspect of the storytelling chain. It shapes how we find, gather, analyze, present and distribute information. Data X will allow us to explore how journalists can tell stories better with data.”
One example of this enhanced storytelling is “Snow Fall: The Avalanche at Tunnel Creek,” an article published on the New York Times website in 2012. “Snow Fall” weaves text, video, interactive diagrams and photographs together to tell of an ill-fated, backcountry skiing expedition in the Cascades. Graphic elements arise effortlessly for the reader, adding multiple dimensions to the story.
Lule likes to tell students about “mathletes” who bring data-rich insights to reporting. The writer-statistician Nate Silver, for example, used meta-analysis of public polling to predict the winner of every single state in the 2012 presidential election.
“We refer to journalists trained in coding and data as ‘unicorns,’” says Lule. “They are rare but they do exist, and they transform newsrooms. We want to produce a generation of unicorns to support the world’s newsrooms with Data X.”
Michael Spear, assistant professor of computer science and engineering, studies the hardware and software systemsthat underpin data analytics and make up part of Data X.
“Anytime you see the term ‘Big Data,’ understand that beneath that you need a very complex hardware and software system to support it,” says Spear, whose work in transactional memory employs innovative data management to speed up processing in computing systems.
“What distinguishes the products that change the world,” says Spear, “is their software. Their software makes them real-time and interactive. It customizes itself to the way you work and it connects you to the world and leverages all the data that’s available.
“The systems that we build are going to need hardware and software that can process data, scale and interact. By learning to build these systems, Lehigh students are going to be able to shape future technology in any field they choose.”
In almost every conceivable industry sector and discipline, the world is waking up to the power of data––and Lehigh alumni are already leading the way.
Glenn DuPaul, who interned his junior year with Jedlovec, recently became the first director of basketball analytics for the NBA’s Brooklyn Nets. Previously, he worked in sabermetrics for the Kansas City Royals.
“The data set we have access to takes a snapshot of the court 25 times per second, and tracks player and ball movement and discrete events like dribbles, passes, shots and screens,” says DuPaul.
“My main job is turning raw data into information that can inform objective decisions to assist our staff, including the general manager, coaches, even the strength and conditioning staff. Our process is driven by what question we want to answer.”
Brian Davison, associate professor of computer science and engineering, trains his students in large scale data analysis, including recommender and prediction systems, as well as filtering.
“Three of my students recently went to Yahoo! Labs,” says Davison, who spent a sabbatical with the data science group at Facebook two years ago. “Others have gone on to become research scientists at Google and Microsoft.”
Kathleen Egan ’90, vice president with retail-analytics provider Quri, has advanced from the early years of information technology to the onset of the Internet to the latest wave of mobile digital technologies.
“Mobile devices represent another whole wave of data,” says Egan, “with geopositioning, people on their devices much more frequently, so many apps open all at once, and personalized information about customers becoming available. What we’re seeing now is more sensors, in stores, in people’s everyday lives.
“There’s a vast ocean of data, and it’s getting bigger every day. The trick is to figure out what to do with it.”
The interdisciplinary edge
Lehigh’s tradition of interdisciplinary education fosters the creativity necessary to use data analytics effectively, says David Griffith, chair of the marketing department in the CBE.
“There’s a creativity aspect evolving with Big Data,” says Griffith. “You’re dealing with unstructured data, and for that you need creative thinkers with a fundamental understanding of what data is and how it can be used by businesses to improve marketing, branding and customer engagement.
“I think that’s what we do here at Lehigh. We train people in interdisciplinary ways so that they’re able to pull allsorts of ideas together and look at a problem through different lenses.”
The university’s academic leaders agree.
“Data X reflects and elaborates on what I consider the signature strength of Lehigh –– its commitment to interdisciplinary education,” said Donald Hall, the Herbert J. and Ann L. Siegel Dean of the College of Arts and Sciences.
“This initiative will allow our students and faculty to explore the exciting synergies between and among their disciplinary homes and the field of data science — where new knowledge is being generated and where jobs are available for students after graduation. I see our college as central to the Data X initiative.”
“The integration of a business college of our caliber with a top engineering college is going to place Lehigh in a niche position in the business school world and Lehigh business students head and shoulders above their peers,” says Georgette Chapman Phillips, the Kevin and Lisa Clayton Dean of the College of Business and Economics.
Lopresti sees a thirst for computational and data analytics skills across the board.
“The Data X initiative is very timely now because of massive demand from students in computer science, in engineering and in all other fields—business, humanities, the sciences,” he says.
“Employers are crawling over each other to find these kinds of graduates because they see what we see: There’s another wave of technological and educational transformation on the horizon. Data X is Lehigh’s opportunity to catch this wave.”
Story by Chris Quirk