Segmentation |
|
Given a series of data, [d0, d1, ..., dn-1], does it consist of one, two [[d0, ..., di], [di+1, ..., dn-1]], or more segments, where each segment is homogeneous in some way? This problem is a special case of unsupervised classification because here the members of a segment ("class") must be contiguous. The number of segments and their boundaries must be learned. Given an estimator, est, we can estimate a model for each (hypothetical) segment, calculate the message-length of each segment and its model, include the encoding of the segment "cut-points", and thus cost a segmentation. The best segmentation is the one having the shortest total message length. W.D.Fisher (1958) gave an O(K.n2)-time algorithm to find the best segmentation having exactly K segments, but gave no "stopping criterion" to determine the best K. If we can break the total segmentation cost into the sum of the cost of each segment, [di+1, ..., dj] where j>i, we can use the 1-dimensional [dynamic programming algorithm] (DPA). Such a break-down requires stating: the length of each segment, the parameters of each segment (to appropriate precision), and the segment's members (the data) given those parameter estimates. This break-down is an approximation to the true cost of the segmentation because it omits the number of segments (small), and states the lengths of all segments which is redundant if the length of the series is common knowledge[*]. However these discrepancies are small. (And they can be removed by using Fisher's algorithm with our costs but then we have to try K=1, 2, ..., Kmax.)
The natural null-hypothesis is to apply the estimator to fit a single model to all of the data in the whole series. The segmentation algorithm can be applied to any type of data -- univariate, multivariate, discrete, continuous, mixed, correlated or not -- provided only that the estimator, est, and the dataSet agree about that type. The DPA takes O(n2)-time if the cost of each segment can be calculated in constant time. Also see
[*] A way to "balance the books" is for the null-hypothesis to include the cost of stating the length of the series. |
|
|