UCI ML Repository Highlights Four Impactful Projects at 2022 ML Hackathon

July 6, 2022

The UCI Machine Learning (ML) Repository hosted the 2022 Machine Learning Hackathon from May 18 to May 29. Throughout the hackathon, participants engaged with members of the UCI ML Repository and its datasets to build creative and meaningful projects. On June 3, hackathon organizers held an awards ceremony to review project submissions and recognize four winning hacks.

Junior computer science major Angel Vilchis 

Overall Best: Personalizing Recommendations Without User Activity
The UCI ML Repository is home to over 600 datasets. While there are plenty of ways to filter for datasets, how about using ML to do the work for you? Angel Vilchis, a junior computer science major specializing in intelligent systems, built a model that recommends datasets related to a select dataset. Vilchis says the model’s recommendations are highly accurate, and it benefits all users browsing the UCI ML Repository. 

You first select a dataset in the UCI ML Repository you’re interested in. Then, you specify how many related datasets you would like to be recommended. Datasets are recommended based on how similar they are to the selected dataset based on three measures: characteristics, context and popularity. You can also customize the model to prefer one similarity measure over the other. 

Computer Science Professor Sameer Singh with the SEW.NLP team.

Overall Runner Up: SEW.NLP – NLP for Dataset Parsing
Knowing the context surrounding data that is collected is important and can help determine whether or not it’s suitable for your needs. To better understand datasets, the SEW.NLP team created a question-answering NLP model. They used SciBert and XLNet models and the Qasper dataset to extract information from scientific papers about datasets in the UCI ML Repository.

SEW.NLP was created by:

  • Edoardo Botta – senior economics and computer science major, Università Bocconi
  • William Han – senior psychological science major, UCI
  • Sanay Talsania – senior business information management major, UCI

Most Creative: UnlimitedMonsterLearning – Automatic Statistician
What does it mean when a dataset is “good”? UnlimitedMonsterLearning strives to answer that question and address related concerns and ethics about the quality of datasets. The team evaluated the quality of datasets using statistical parameters and analyzed the pattern of dataset popularity with respect to a variety of statistical qualities. 

Yiqin Chen and Hao Li from UnlimitedMonsterLearning.

UnlimitedMonsterLearning was created by: 

  • Yiqin Chen – junior business information management and data science major, UCI
  • Hao Li – junior mathematics major and statistics minor, UCI
The four students who created Team Untitled.

Most Impactful:  Team Untitled – Search
The goal of Team Untitled is to improve and refine the process of searching for information in datasets using NLP. Team Untitled combined latent dirichlet allocation and latent semantic analysis modeling techniques to find the most relevant words in datasets. This helps expand search queries and find the most relevant datasets. 

Team Untitled was created by: 

— Karen Phan