Large-Scale Algorithms

in Semi-Supervised Learning

Published by MIT Press | 2006 | Semi-Supervised Learning edition

In Chapter 11, it is shown how a number of graph-based semi-supervised learning algorithms can be seen as the minimization of a specific cost function, leading to a linear system with n equations and unknowns (with n the total number of labeled and unlabeled examples). Solving such a linear system will in general require on the order of O(kn2) time and O(kn) memory (for a sparse graph where each data point has k neighbors), which can be prohibitive on large datasets (especially if k = n, i.e. the graph is dense). We present in this chapter a subset selection method that can be used to reduce the original system to one of size m << n. The idea is to solve for the labels of a subset S of X of only m points, while still retaining information from the rest of the data by approximating their label with a linear combination of the labels in S (using the induction formula presented in Chapter 11). This leads to an algorithm whose computational requirements scale as O(m2n) and memory requirements as O(m2), thus allowing one to take advantage of significantly bigger unlabeled datasets than with the original algorithms.