Scripts for downloading, preprocessing, and numpy-ifying popular machine learning datasets
Machine Learning Datasets (mlds) is a repo for downloading, preprocessing,
and numpy-ifying popular machine learning datasets. Across projects, I commonly
found myself rewriting the same lines of code to standardize, normalize, or
other-ize data, encode categorical variables, parse out subsets of features,
among other transformations. This repo provides a simple interface to download,
parse, transform, and clean datasets via flexible command-line arguments. All
of the information you need to start using this repo is contained within this
one ReadMe, ordered by complexity (No need to parse through any ReadTheDocs
documentation).
Integrating a data source into mlds
requires writing a simple module within
the mlds.downloaders
package to download the dataset (by implementing aretrieve
function). Afterwards, add the module to the package
initializer
and the dataset is ready to be processed by mlds
. Notably, the only
requirement is that retrieve
should return a dict
containing dataset
partitions as keys and pandas DataFrames as
values. At this time, the following datasets are supported:
This repo was designed to be interoperable with the following
models and
attacks repos. I recommend installing
an editable version of this repo via pip install -e
. Afterwards, you can
download, rescale, and save MNIST so that
features are in [0, 1] (via a custom UniformScaler
transformer, which scales
all features based on the maximum and minimum feature values observed across
the training set) with the dataset referenced as mnist
via:
mlds mnist -f all UniformScaler --filename mnist
You can download, rescale, and save the
NSL-KDD so that categorical
features are one-hot encoded (via
scikit-learn
OneHotEncoder)
while the remaining features (all
can be used as an alias to select all
features except those that are to be one-hot encoded) are individually rescaled to
be in [0, 1] (via scikit-learn
MinMaxScaler),
with numeric label encoding (via
scikit-learn
LabelEncoder),
with the dataset referenced asnslkdd_OneHotEncoder_MinMaxScaler_Label_Encoder
(when filename
is not
specified, the file name defaults to the name of the dataset concatenated with
the transformations) via:
mlds nslkdd -f protocol_type,service,flag OneHotEncoder -f all MinMaxScaler -l LabelEncoder
Afterwards, import the dataset filename to load it:
>>> import mlds
>>> from mlds import mnist
>>> mnist
mnist(samples=(60000, 10000), features=(784, 784), classes=(10, 10), partitions=(train, test), transformations=(UniformScaler), version=c88b3d6)
>>> mnist.train
train(samples=60000, features=784, classes=10)
>>> mnist.train.data
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
>>> mnist.train.labels
array([4., 1., 0., ..., 6., 1., 5.], dtype=float32)
Other uses can be found in the
examples
directory.
Below are descriptions of some of the more subtle controls within this repo and
complex use cases.
API: Should you use to interface with mlds
outside of the command line, the
main entry point for the repo is the
process
function within the datasets
module. The docstring contains all of the
necessary details surrounding arguments and their required types. An
example of interfacing with mlds
in this way can be found in the
dnn_prepare
example script.
Cleaning: When a dataset is to be processed from the command line, it can be
optionally cleaned by passing --destupefy
as an argument. This applies a
transformation, after all other transformations, via the
Destupefier.
The Destupifier
is an experimental transformer that removes duplicate
columns, duplicate rows, and single-valued columns. This is particularly
useful when exploring data reduction techniques that may produce a large
amount of irrelevant features or duplicate samples.
Downloading: If you wish to add a dataset that is not easily retrievable from
another library (i.e, TensorFlow
Datasets, Torchvision
Datasets,
pandas.read_csv,
etc.), then you may find the
download
function within mlds.downloaders
helpful. Specifically, given a tuple of
URLs, mlds.downloaders.download
leverages the
Requests library to return a
dictionary containing the requested resource as a bytes object.
Partitions: Following best practices, transformations are fit to the training
set and transformed to all other partitions. A partition is considered to
be a training set if its key is “train” in the data dictionary returned by
dataset downloaders (This is commonly the case when downloading datasets
through other frameworks such as Torchvision
Datasets and TensorFlow
Datasets). If the data dictionary
does not contain this key, then all partitions are fitted and transformed
separately.
Transformations: Currently, the following data transformations are supported:
0-1 scaling (across all features): UniformScaler
Adding custom transformations requires inheritingsklearn.base.BaseEstimator
and sklearn.base.TransformerMixin
and
implementing fit
, set_output
, and transform
methods. Alternatively,
transformations from other libraries can be simply aliased into thetransformations
module.
This repo was designed to allow easily download and manipulate data from a
variety of different sources (which may have many deficiencies) with a suite of
transformations. The vast majority of the legwork is done within themlds.datasets.process
method; once datasets are downloaded via downloader
,
then data transformations are composed and applied via
ColumnTransformer.
As described above, if “train” is a key in the dictionary representing the
dataset, then the transformers are fit only to this partition.
Notably, the order of features is preserved, post-transformation (including
expanding categorical features with one-hot vectors), and a set of metadata is
saved with the transformed dataset, including: the names of the partitions, the
transformations that were applied, the current abbreviated commit hash of the
repo, the number of samples, the number of features, the number of classes,
one-hot mappings (if applicable), and class mappings (if a label encoding
scheme was applied). Metadata that pertains to the entire dataset (such as the
transformations that were applied) is saved as a dictionary attribute forDataset
objects, while partition-specific information (such as the number of
samples) is saved as an attribute for Partition
objects (which are set as
attributes in Dataset
objects).