Support Vector Clustering is the name of the first clustering technique based on the SVM formalism and it was introduced by Ben Hur et al. It is still subject of research and improvements (you can read an almost omnicomprehensive review in my master’s thesis).
The software status
The software you can download at the bottom of this page was developed for making the experiments included in my master’s thesis. It is not mature and/or stable, and is not complete neither. You can consider the software to be in a pre-alpha status.
What it implements
- The Cluster Description stage by One Class SVM (which is equivalent to the SVDD under certain conditions, see the Appendix A of my master’s thesis) implemented in LIBSVM.
- Two different Cluster Labeling algorithms: the original graph-based proposed by Ben Hur et al, and the Cone Cluster Labeling by Lee and Daniels. Anyway the original one works poor, probably for mistakes in the implementation.
- The Secant-like Kernel Width Selector, originally proposed by Lee and Daniels and improved by myself as described in my master’s thesis.
- My heuristics for the selection of the Soft Margin parameter. See my master’s thesis for further details.
- Three kernels: Guassian, Laplace, Exponential. The first two are “official” and you can see them on the command line help (by running the svc without parameters). The third one is hidden, but you can use it by passing “7″ as kernel number.
- Stopping criteria: this is the bad part. We have only the original stopping criterion based of the fraction of support vectors, plus a maximum number of iterations. Moreover, for making easier the experiments on labeled and well known datasets, a third stopping criterion was added which stops the running when the accuracy start decreasing after it increased for a number of iterations.
- Minimum cluster cardinality: you can specify what clusters have to be discarded based on their cardinality.
- L1 Distance: you can replace the L2 Distance by the L1 distance, for measuring the distances among points in a cluster labeling algorithm. This is (still) a feature without strong theoretical roots, but experimentally it provides some benefits when data are scaled/normalized.
- Alternative policy for clustering BSVs: BSVs can be clustered at the end of the procedure, instead of clustering them together with SVs. In the software this is the default behavior, in opposition to the literature.
- LIBSVM file format: the file input format is the same used by LIBSVM.
TODO and Future Work
I am developing a completely reengineered version of this software, which will be based on LIBSVM Plus.
This next generation software will be a library which will implement a more stable, complete and ready-for-users version of the Support Vector Clustering. Moreover, it will provide other Support Vector methods for Clustering.
More details will be provided when the software will be released.
The software includes a hacked version of LIBSVM for the cluster description stage. Moreover, it includes the Boost Graph Library for graph operations.
LIBSVM is public domain, Boost is released under the Boost Software License. My code is GPL instead.
Useful further informations
A useful discussion between a student of the City University of Hong Kong and me can be found in the comments of this post.
(it compiles and works only on Linux and Mac OS X)