A tutorial of clustering algorithm
Original article: http://www.elet.polimi.it/upload/matteucc/clustering/tutorial_html/hierachical.html
Auxiliary information: http://149.170.199.144/multivar/ca.htm
Level clustering method
Basic working principle
The object of the N * N is a distance matrix (or a similarity matrix) of the N * N is given, the basic steps of the hierarchical cluster method (see S.c. Johnson in 1967) are as follows:
1. Make each object into a group, a total of N groups, each group contains only an object. The distance between the group and the group is the distance between the objects they contain.
2. The nearest combination is combined into a group, so the total number of groups is less.
3. Recalculate the distance between the new group and all the old groups.
4. Repeat steps 2 and 3 until it is finally merged into one group (this group contains N objects).
Depending on step 3, the layer clustering method can be divided into several categories: Single-Linkage, Complete-Linkage, and Average-Linkage clustering methods.
SINGLE-LINKAGE clustering (also known as Connectedness or Minimum Method): The minimum distance between the group is equal to between the two sets of objects. Complete-Linkage Clustering (also known as Diameter or Maximum Method): The multi-phase distance is equal to the maximum distance between the two groups. Average-Linkage Clustering: The inter-group distance is equal to the average distance between the two groups. Avertage-link clustering is a UCLUS method of R. D'Andrade (1978). It uses Median distance, which is better than the average distance performance in terms of abnormal data objects.
Single connection clustering algorithm
The following we use a single connection clustering algorithm as an example of the principle of the Johnson algorithm.
A single connection algorithm belongs to the agglomeration algorithm, that is, two old combinations into a new group, until it is finally merged into one group. Each merges once, the corresponding row and column are removed in the distance matrix.
First introduce some symbols: The N objects to be clustered are numbered 0, 1, ..., (N-1), D = [D (i, j)] to represent the corresponding N * N Distance matrix. The symbol L (k) represents the level of the kth group, and the group (M) composed of the object m is (m), the group (R) and the group (S) are remembected as D [(R), (s)] .
The single connection clustering algorithm is as follows:
1. There are N groups in the initial time, and each group consists of an object. Order sequence number m = 0, L (m) = 0.
2. Look for minimum distance D [(r), (s)] = min d [(i), (j)].
3. Combine two groups (R) and (S) into a new group (R, S); order m = m 1, L (m) = d [(r), (s)]
4. Update the distance matrix D: Remove the ranks of the group (R) and group (s), simultaneously join the ranks representing the new group (R, S); simultaneously defines the new group (R, S) and the old group (K The distance is d [(k), (r, s)] = min D [(k), (r)], d [(k), (s)]
5. Repeat steps 2-4 until all objects are merged into one group.