Machine Learning Datasets

Machine Learning Datasets (mlds) is a repo for downloading, preprocessing,
and numpy-ifying popular machine learning datasets. Across projects, I commonly
found myself rewriting the same lines of code to standardize, normalize, or
other-ize data, encode categorical variables, parse out subsets of features,
among other transformations. This repo provides a simple interface to download,
parse, transform, and clean datasets via flexible command-line arguments. All
of the information you need to start using this repo is contained within this
one ReadMe, ordered by complexity (No need to parse through any ReadTheDocs
documentation).

Datasets
Quick Start
Advanced Usage
Repo Overview

Datasets

Integrating a data source into mlds requires writing a simple module within
the mlds.downloaders package to download the dataset (by implementing a
retrieve function). Afterwards, add the module to the package
initializer
and the dataset is ready to be processed by mlds. Notably, the only
requirement is that retrieve should return a dict containing dataset
partitions as keys and pandas DataFrames as
values. At this time, the following datasets are supported:

Quick Start

This repo was designed to be interoperable with the following
models and
attacks repos. I recommend installing
an editable version of this repo via pip install -e. Afterwards, you can
download, rescale, and save MNIST so that
features are in [0, 1] (via a custom UniformScaler transformer, which scales
all features based on the maximum and minimum feature values observed across
the training set) with the dataset referenced as mnist via:

mlds mnist -f all UniformScaler --filename mnist

You can download, rescale, and save the
NSL-KDD so that categorical
features are one-hot encoded (via
scikit-learn
OneHotEncoder)
while the remaining features (all can be used as an alias to select all
features except those that are to be one-hot encoded) are individually rescaled to
be in [0, 1] (via scikit-learn
MinMaxScaler),
with numeric label encoding (via
scikit-learn
LabelEncoder),
with the dataset referenced as
nslkdd_OneHotEncoder_MinMaxScaler_Label_Encoder (when filename is not
specified, the file name defaults to the name of the dataset concatenated with
the transformations) via:

mlds nslkdd -f protocol_type,service,flag OneHotEncoder -f all MinMaxScaler -l LabelEncoder

Afterwards, import the dataset filename to load it:

>>> import mlds
>>> from mlds import mnist
>>> mnist
mnist(samples=(60000, 10000), features=(784, 784), classes=(10, 10), partitions=(train, test), transformations=(UniformScaler), version=c88b3d6)
>>> mnist.train
train(samples=60000, features=784, classes=10)
>>> mnist.train.data
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
>>> mnist.train.labels
array([4., 1., 0., ..., 6., 1., 5.], dtype=float32)

Other uses can be found in the
examples
directory.

Advanced Usage

Below are descriptions of some of the more subtle controls within this repo and
complex use cases.

API: Should you use to interface with mlds outside of the command line, the
main entry point for the repo is the
process
function within the datasets module. The docstring contains all of the
necessary details surrounding arguments and their required types. An
example of interfacing with mlds in this way can be found in the
dnn_prepare
example script.
Cleaning: When a dataset is to be processed from the command line, it can be
optionally cleaned by passing --destupefy as an argument. This applies a
transformation, after all other transformations, via the
Destupefier.
The Destupifier is an experimental transformer that removes duplicate
columns, duplicate rows, and single-valued columns. This is particularly
useful when exploring data reduction techniques that may produce a large
amount of irrelevant features or duplicate samples.
Downloading: If you wish to add a dataset that is not easily retrievable from
another library (i.e, TensorFlow
Datasets, Torchvision
Datasets,
pandas.read_csv,
etc.), then you may find the
download
function within mlds.downloaders helpful. Specifically, given a tuple of
URLs, mlds.downloaders.download leverages the
Requests library to return a
dictionary containing the requested resource as a bytes object.
Partitions: Following best practices, transformations are fit to the training
set and transformed to all other partitions. A partition is considered to
be a training set if its key is “train” in the data dictionary returned by
dataset downloaders (This is commonly the case when downloading datasets
through other frameworks such as Torchvision
Datasets and TensorFlow
Datasets). If the data dictionary
does not contain this key, then all partitions are fitted and transformed
separately.
Transformations: Currently, the following data transformations are supported:
- 0-1 scaling (per feature): MinMaxScaler
- One-hot encoding: MinMaxScaler
- Categorical to integer encoding: OrdinalEncoder
- Outlier-aware scaling: RobustScaler
- Remove mean and scale to unit variance: StandardScaler
- 0-1 scaling (across all features): UniformScaler
  
  Adding custom transformations requires inheriting
  sklearn.base.BaseEstimator and sklearn.base.TransformerMixin and
  implementing fit, set_output, and transform methods. Alternatively,
  transformations from other libraries can be simply aliased into the
  transformations module.

Repo Overview

This repo was designed to allow easily download and manipulate data from a
variety of different sources (which may have many deficiencies) with a suite of
transformations. The vast majority of the legwork is done within the
mlds.datasets.process method; once datasets are downloaded via downloader,
then data transformations are composed and applied via
ColumnTransformer.
As described above, if “train” is a key in the dictionary representing the
dataset, then the transformers are fit only to this partition.

Notably, the order of features is preserved, post-transformation (including
expanding categorical features with one-hot vectors), and a set of metadata is
saved with the transformed dataset, including: the names of the partitions, the
transformations that were applied, the current abbreviated commit hash of the
repo, the number of samples, the number of features, the number of classes,
one-hot mappings (if applicable), and class mappings (if a label encoding
scheme was applied). Metadata that pertains to the entire dataset (such as the
transformations that were applied) is saved as a dictionary attribute for
Dataset objects, while partition-specific information (such as the number of
samples) is saved as an attribute for Partition objects (which are set as
attributes in Dataset objects).

Machine Learning Datasets

Table of Contents

Datasets

Quick Start

Advanced Usage

Repo Overview