Luke Hellum presenting a poster describing his semester of research, “Data Cleaning, Deduplication, and Entry Matching in a Carbon Commitment Database,” at the Yale Statistics & Data Science poster session.

On Thursday, May 2, 2019, Luke Hellum, a Yale College data science and statistics and energy studies major, shared the results of his senior capstone research project. Over the past 6 months, he built a statistical R software package that helps clean and organize the vital, messy universe of climate action commitments from cities, regions, companies, investors, and civil society.

Over the past few years, the Data-Driven Lab has built a database of climate action commitments put forward by a wide range of corporate and local government authorities, alongside contextual information, such as these actors’ population or GDP. This database draws from 17 different sources – all providing important data, in a wide array of different forms and formats. Organizing this data unlocks important information about the scope and impact of this rising tide of “bottom-up” climate action – but often demands time-consuming cleaning and validation.

As the database grew and became more complex, matching new information to existing database entries became an increasingly pressing issue. Luke’s semester-long research project sought to improve the database management and entry matching through a preliminary analysis on Python and a collection of R scripts that improve data cleaning and matching. His analysis honed in on the most persistent problems when matching entries (i.e. encoding, language, typos, and uncleaned names) and provided guidance on how to streamline data entry and management. The new R scripts streamline efforts to update and identify data entries, and will allow future datasets to be managed in a more systematic way.

“Cleaning and wrangling data is the first — and often the most difficult — hurdle in conducting data-driven analysis,” said Dr. Angel Hsu, Luke’s capstone advisor and the Director of Data-Driven Lab. “We plan to road-test this tool over the summer, and look forward to sharing the finished version later this year.”

As he shared his research with students, professors, and visitors at the Statistics and Data Science Poster Session, Luke noted the additional challenges and rewards of working with a real-world dataset: “It was a pleasure having the opportunity to work with Professor Hsu and her team on the project. Going into it, I expected to learn some data science techniques, but I also came out of it with an appreciation of how challenging it can be to clean data and effectively incorporate it into a large and complex database.”