
As the name suggests this form of clustering has its historic roots in the field of statistical analysis. This simple algorithm produces a class hierarchy based on numerical similarity within object descriptions.
Each object is described by a number of attributes, the values of these attributes are numerical values. A typical object description for a person may be as follows :
| Attribute | Value |
| Height (M) | 1.85 |
| Weight (Kg) | 180.0 |
| IQ | 100 |
A set of object descriptions may be represented as a set of vectors as follows :
Object 1 |
(1.85 180.0 100) |
Object 2 |
(1.75 195.5 80) |
Object 3 |
(1.45 135.0 195) |
.... |
.... |
A means of measuring the similarity or distance between the object descriptions is required. Any metric can be used for example the Euclidean distance or city block distance.
Where xi and yi are the elements of two object description vectors X and Y.
The distance measure employed has to initially measure the distance between single vectors but will also have to measure the distance between a vector and a cluster (a collection of vectors) and between two clusters. Various approaches are possible such as the distance to the nearest point within the cluster, the furthest point in the cluster or to a point in some sense representing the centre of a cluster.
Two algorithms for performing statistical clustering are presented. Both of these algorithms perform bottom-up clustering. The algorithms start with a few small clusters and enlarge them to create new clusters. (An alternative to this is top-down clustering where the clustering system starts with all data in one large cluster and then subdivides this set into smaller clusters, top-down clustering tends to be computationally expensive). Both of these statistical clustering algorithms are non-incremental and simple to implement.
The following algorithm is from (Thornton, 1992). A hierarchical classification is produced. All input data is required prior to clustering to allow a similarity matrix to be constructed.
|
|
The second algorithm from (Hutchinson, 1994) uses the notion of a cluster diameter. This is the largest distance between any two points in a cluster. A point is said to be nearly central if the distance from it to every other point in the cluster is less than 2/3 of the cluster diameter. The algorithm, which produces a partition of the input data rather than a hierarchy, follows. The sort step requires all data to be available prior to clustering. Should the distance between any two input pairs be the same, the location within the sorted list will be arbitrary and could lead to different classifications being produced.
|
|
|
|
Authored by Serengul Smith
E-mail to:
serengul1@mdx.ac.uk
School of Computing Science Middlesex University
Revised: September 1998