Projects
-
CORDS - Data Subset Selection
Reduce end to end training time from days to hours and hours to minutes using coresets and data selection. CORDS implements a number of state of the art data subset selection algorithms and coreset algorithms. Some of the algorithms currently implemented with CORDS include: GLISTER, GradMatchOMP, GradMatchFixed, CRAIG, SubmodularSelection, RandomSelection etc. CORDS provides efficient implementations that can significantly reduce training costs while maintaining model accuracy.
-
DISTIL - Active Learning
DISTIL is a library that features many state-of-the-art active learning algorithms. Implemented in PyTorch, it gives fast and efficient implementations of these active learning algorithms. It allows users to modularly insert active learning selection into their pre-existing training loops with minimal change. Most importantly, it features promising results in achieving high model performance with less amount of labeled data. If you are looking to cut down on labeling costs, DISTIL should be your go-to for getting the most out of your data.
-
SubmodLib - Submodular Optimization
SubmodLib provides efficient implementations of submodular functions and optimization algorithms. Use it to summarize massive datasets using submodular optimization. The library includes implementations of various submodular functions like facility location, graph cut, saturated coverage, and many more. It's designed for high performance with optimized C++ backends and Python interfaces.
-
SPEAR - Data Programming
SPEAR is a python library that reduces data labeling efforts using data programming. It implements several recent approaches such as Snorkel, ImplyLoss, Learning to reweight, etc. In addition to data labeling, it integrates semi-supervised approaches for training and inference. SPEAR enables users to leverage weak supervision sources and programmatically label data at scale.
-
TRUST - Targeted Subset Selection
TRUST focuses on targeted subset selection techniques for efficient machine learning. It provides tools and methods to identify and select the most informative data points for training, helping reduce computational costs while maintaining or improving model performance.