项目作者: josemarialuna

项目描述 :
This package contains the code for executing clustering validity indices in Java by using K-means from Weka. The package includes the following clustering validity indices: Silhouette, Dunn, BD-Silhouette, BD-Dunn, Davies-Bouldin, Calinski-Harabasz, MaximumDiameter, SquaredDistance, AverageDistance, AverageBetweenClusterDistance, MinimumDistance.
高级语言: Java
项目地址: git://github.com/josemarialuna/smallDataIndex.git
创建时间: 2016-11-03T17:48:08Z
项目社区:https://github.com/josemarialuna/smallDataIndex

开源协议:

下载


smallDataIndex

This package contains Weka Cluster algorithm with a complete list of indices that will help you to decide the optimal number of clusters that the dataset could have. The package includes the followings indices:

  • Silhouette
  • Dunn
  • BD-Silhouette [1]
  • BD-Dunn [1]
  • Davies-Bouldin
  • Calinski-Harabasz
  • MaximumDiameter
  • SquaredDistance
  • AverageDistance
  • AverageBetweenClusterDistance
  • MinimumDistance.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

This is a Maven Project with OpenCSV 2.4 and Weka 3.8.0 dependencies. Both of them are included in the pom.xml file in the repository.

  1. <dependency>
  2. <groupId>au.com.bytecode</groupId>
  3. <artifactId>opencsv</artifactId>
  4. <version>2.4</version>
  5. </dependency>
  6. <dependency>
  7. <groupId>nz.ac.waikato.cms.weka</groupId>
  8. <artifactId>weka-stable</artifactId>
  9. <version>3.8.0</version>
  10. </dependency>

Running

WekaCluster is the main class, and it includes 4 arguments that can be set from the code or directly when you execute the jar file:

  • Argument 0: minNumCluster: it is the minimum cluster number that the dataset is going to be tested.
  • Argument 1: maxNumCluster: it is the maximum cluster number that the dataset is going to be tested.
  • Argument 2: pathFile: it is the path of the input dataset. It must includes the complete pathfile.
  • Argument 3: outFile: it is the path of the output result file. It must includes the complete pathfile.
  • Argument 4: selector: you can choose between SMALLDATA, BIGDATA and ALL. SMALLDATA only execute smalldata indices. BIGDATA just execute BD-Silhouette and BD-Dunn indices from [1]. And ALL execute SMALLDATA and BIGDATA indices.
  1. int minNumCluster = 2;
  2. int maxNumCluster = 10;
  3. int selector = SMALLDATA;
  4. String fileName = "SmallDataset.csv";
  5. String folderFile = "C:\\datasets\\";
  6. String pathFile = folderFile + fileName;
  7. String outFile = getFileNameOutput(selector, fileName);

For this configuration the application load a file called SmallDataset.csv in “C:/datasets” and the result file will be saved as “Results-SmallDataset.csv”in the application folder.

Execution example

If we preffer executing in a terminal using java we just have to:

  1. java -jar smallDataIndices.jar 2 10 C:/datasets/SmallDataset.csv Results.csv ALL
  2. java -jar smallDataIndices.jar 10 20 datasets/dataset.csv results.csv SMALLDATA

Built With

Authors

Acknowledgments

  • University of Waikato

References

[1] Luna-Romera, J.M., García-Gutiérrez, J., Martínez-Ballesteros, M. et al. Prog Artif Intell (2017). https://doi.org/10.1007/s13748-017-0135-3

Please, cite as: Luna-Romera, J.M., García-Gutiérrez, J., Martínez-Ballesteros, M. et al. Prog Artif Intell (2017). https://doi.org/10.1007/s13748-017-0135-3 (https://link.springer.com/article/10.1007%2Fs13748-017-0135-3)