Scaling EM (Expectation Maximization) Clustering to Large Databases
- Paul S. Bradley ,
- Usama M. Fayyad ,
- Cory A. Reina
MSR-TR-98-35 |
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the Expectation-Maximization (EM) algorithm. The database community has focused on distance-based clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distance-based algorithms (such as K-Means), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that are compressible and regions that must be maintained in memory. The approach operates within the confines of a limited main memory buffer and requires at most a single database scan. Data resolution is preserved to the extent possible based upon the size of the main memory buffer and the fit of the current clustering model to the data. We extend the method to efficiently update multiple models simultaneously. Computational tests indicate that this scalable scheme outperforms sampling-based approaches – the straightforward alternatives to “scaling” traditional in-memory implementations to large database.