Abstract
Selecting prototypical examples from a source distribution to represent a target data distribution is a fundamental problem in machine learning. Existing subset selection methods often rely on implicit importance scores, which can be skewed towards majority classes and lead to low-quality prototypes for minority classes.
We present UniPROT, a subset selection framework that minimizes the optimal transport distance between a uniformly weighted prototypical distribution and the target distribution. While this formulation is natural, it yields a cardinality-constrained maximization of a super-additive objective that is generally hard to approximate efficiently.
UniPROT resolves this by reformulating the OT marginal constraints to obtain a partial optimal transport based submodular objective. This leads to a greedy algorithm with a (1 - 1 / e) approximation guarantee relative to the original super-additive objective while remaining scalable in practice.
Across imbalanced classification benchmarks and large language model finetuning and pretraining settings under domain imbalance, UniPROT consistently improves minority-class representation without sacrificing majority-class accuracy.