This package contains the code for calculating external clustering validity indices in Spark. The package includes Chi Index among others.
This package contains the code for executing 15 external clustering validity indices in Spark. Chi index Python implementation can be found in this repository: Chi Index Python . This package includes the following indices in Scala:
Chi Index takes a value in [0, 2], where 0 is given by the worst clustering solution, and 2 is the best value that Chi Index can achieve.
The cluster indices can be executed using K-means and Bisecting K-Means from Spark MLlib, and Linkage method.
Please, cite as: Luna-Romera JM, Martínez-Ballesteros M, García-Gutiérrez J, Riquelme JC. External clustering validity index based on chi-squared statistical test. Information Sciences (2019) 487: 1-17. https://doi.org/10.1016/j.ins.2019.02.046. (http://www.sciencedirect.com/science/article/pii/S0020025519301550)
The package is ready to be used. You only have to download it and import it into your workspace. The package includes iris dataset as example and it can be used with the main test class.
MainExternalTest class has been configured for being executed in a laptop. You can execute this class directly but there are some variables that you should be noticed:
val spark = SparkSession
.builder()
.appName("Featuring Clusters")
.master("local[*]")
.getOrCreate()
val irisFile = "iris.data"
val numIterations = 1000
var minClusters = 2
var maxClusters = 10
val origen = irisFile
val destino: String = Utils.whatTimeIsIt()
var idIndex = "-1"
val classIndex = 4
val delimiter = ","
This application can be used with KMeans, GaussianMixture, LDA and BisectingKMeans from SPARK ML. Just comment the methods that are not going to be used:
val clusteringResult = new KMeans()
//val clusteringResult = new GaussianMixture()
//val clusteringResult = new LDA()
//val clusteringResult = new BisectingKMeans()
.setK(numClusters)
.setSeed(1L)
.setMaxIter(numIterations)
.setFeaturesCol("features")
By default, the results are saved in the same a folder than the dataset. The results are saved in several folders:
This data can be copy-pasted directly into an spreadsheet to be visualized.
For the iris dataset, the results are following:
Please, cite as: Luna-Romera JM, Martínez-Ballesteros M, García-Gutiérrez J, Riquelme JC. External clustering validity index based on chi-squared statistical test. Information Sciences (2019) 487: 1-17. https://doi.org/10.1016/j.ins.2019.02.046. (http://www.sciencedirect.com/science/article/pii/S0020025519301550)
@article{LUNAROMERA20191,
title = {External clustering validity index based on chi-squared statistical test},
journal = {Information Sciences},
volume = {487},
pages = {1-17},
year = {2019},
issn = {0020-0255},
doi = {https://doi.org/10.1016/j.ins.2019.02.046},
url = {https://www.sciencedirect.com/science/article/pii/S0020025519301550},
author = {José María Luna-Romera and María Martínez-Ballesteros and Jorge García-Gutiérrez and José C. Riquelme},
keywords = {Clustering analysis, External validity indices, Comparing clusters, Big data}
}