The Learning-Curve Method Applied To Clustering

Proceedings of Eighth International Workshop on Artificial Intelligence and Statistics, ® Key West, Florida |

Published by Morgan Kaufmann

We examine the learning-curve method, an approach for scaling machine-learning algorithms to large data sets. The approach is based on the observation that the computational cost of learning a model increases as a function of the sample size of the training data, whereas the accuracy of the model has diminishing improvements as a function of sample size. Thus, the learning-curve method monitors the increasing costs and performance as larger and larger amounts of data are used for training, and terminates learning when future costs outweigh future benefits. In this paper, we formalize the learning-curve method and its associated cost-benefit trade-off in terms of decision theory. In addition, we describe the application of the learning curve method to the task of model-based clustering via the EM algorithm. In experiments on two real domains, we show that the learning-curve method produces models that are nearly as accurate as those trained on complete data sets, but with dramatically reduced run times. Finally, we describe an extension of the basic learning-curve approach for model-based clustering that results in an additional speedup. Our approach is based on the observation that the shape of the learning curve for a given model and data set is roughly independent of the number of EM iterations used during training. Thus, we run EM for only a few iterations to decide how many cases to use for training, and then run EM to full convergence once the number of cases is selected.