项目作者: sheatsley

项目描述 :
Scripts for downloading, preprocessing, and numpy-ifying popular machine learning datasets
高级语言: Python
项目地址: git://github.com/sheatsley/datasets.git
创建时间: 2019-06-12T19:16:06Z
项目社区:https://github.com/sheatsley/datasets

开源协议:BSD 3-Clause "New" or "Revised" License

下载


Machine Learning Datasets

Machine Learning Datasets (mlds) is a repo for downloading, preprocessing,
and numpy-ifying popular machine learning datasets. Across projects, I commonly
found myself rewriting the same lines of code to standardize, normalize, or
other-ize data, encode categorical variables, parse out subsets of features,
among other transformations. This repo provides a simple interface to download,
parse, transform, and clean datasets via flexible command-line arguments. All
of the information you need to start using this repo is contained within this
one ReadMe, ordered by complexity (No need to parse through any ReadTheDocs
documentation).

Table of Contents

Datasets

Integrating a data source into mlds requires writing a simple module within
the mlds.downloaders package to download the dataset (by implementing a
retrieve function). Afterwards, add the module to the package
initializer

and the dataset is ready to be processed by mlds. Notably, the only
requirement is that retrieve should return a dict containing dataset
partitions as keys and pandas DataFrames as
values. At this time, the following datasets are supported:

Quick Start

This repo was designed to be interoperable with the following
models and
attacks repos. I recommend installing
an editable version of this repo via pip install -e. Afterwards, you can
download, rescale, and save MNIST so that
features are in [0, 1] (via a custom UniformScaler transformer, which scales
all features based on the maximum and minimum feature values observed across
the training set) with the dataset referenced as mnist via:

  1. mlds mnist -f all UniformScaler --filename mnist

You can download, rescale, and save the
NSL-KDD so that categorical
features are one-hot encoded (via
scikit-learn
OneHotEncoder)
while the remaining features (all can be used as an alias to select all
features except those that are to be one-hot encoded) are individually rescaled to
be in [0, 1] (via scikit-learn
MinMaxScaler),
with numeric label encoding (via
scikit-learn
LabelEncoder),
with the dataset referenced as
nslkdd_OneHotEncoder_MinMaxScaler_Label_Encoder (when filename is not
specified, the file name defaults to the name of the dataset concatenated with
the transformations) via:

  1. mlds nslkdd -f protocol_type,service,flag OneHotEncoder -f all MinMaxScaler -l LabelEncoder

Afterwards, import the dataset filename to load it:

  1. >>> import mlds
  2. >>> from mlds import mnist
  3. >>> mnist
  4. mnist(samples=(60000, 10000), features=(784, 784), classes=(10, 10), partitions=(train, test), transformations=(UniformScaler), version=c88b3d6)
  5. >>> mnist.train
  6. train(samples=60000, features=784, classes=10)
  7. >>> mnist.train.data
  8. array([[0., 0., 0., ..., 0., 0., 0.],
  9. [0., 0., 0., ..., 0., 0., 0.],
  10. [0., 0., 0., ..., 0., 0., 0.],
  11. ...,
  12. [0., 0., 0., ..., 0., 0., 0.],
  13. [0., 0., 0., ..., 0., 0., 0.],
  14. [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
  15. >>> mnist.train.labels
  16. array([4., 1., 0., ..., 6., 1., 5.], dtype=float32)

Other uses can be found in the
examples
directory.

Advanced Usage

Below are descriptions of some of the more subtle controls within this repo and
complex use cases.

  • API: Should you use to interface with mlds outside of the command line, the
    main entry point for the repo is the
    process
    function within the datasets module. The docstring contains all of the
    necessary details surrounding arguments and their required types. An
    example of interfacing with mlds in this way can be found in the
    dnn_prepare
    example script.

  • Cleaning: When a dataset is to be processed from the command line, it can be
    optionally cleaned by passing --destupefy as an argument. This applies a
    transformation, after all other transformations, via the
    Destupefier.
    The Destupifier is an experimental transformer that removes duplicate
    columns, duplicate rows, and single-valued columns. This is particularly
    useful when exploring data reduction techniques that may produce a large
    amount of irrelevant features or duplicate samples.

  • Downloading: If you wish to add a dataset that is not easily retrievable from
    another library (i.e, TensorFlow
    Datasets
    , Torchvision
    Datasets
    ,
    pandas.read_csv,
    etc.), then you may find the
    download
    function within mlds.downloaders helpful. Specifically, given a tuple of
    URLs, mlds.downloaders.download leverages the
    Requests library to return a
    dictionary containing the requested resource as a bytes object.

  • Partitions: Following best practices, transformations are fit to the training
    set and transformed to all other partitions. A partition is considered to
    be a training set if its key is “train” in the data dictionary returned by
    dataset downloaders (This is commonly the case when downloading datasets
    through other frameworks such as Torchvision
    Datasets
    and TensorFlow
    Datasets
    ). If the data dictionary
    does not contain this key, then all partitions are fitted and transformed
    separately.

  • Transformations: Currently, the following data transformations are supported:

    • 0-1 scaling (per feature): MinMaxScaler
    • One-hot encoding: MinMaxScaler
    • Categorical to integer encoding: OrdinalEncoder
    • Outlier-aware scaling: RobustScaler
    • Remove mean and scale to unit variance: StandardScaler
    • 0-1 scaling (across all features): UniformScaler

      Adding custom transformations requires inheriting
      sklearn.base.BaseEstimator and sklearn.base.TransformerMixin and
      implementing fit, set_output, and transform methods. Alternatively,
      transformations from other libraries can be simply aliased into the
      transformations module.

Repo Overview

This repo was designed to allow easily download and manipulate data from a
variety of different sources (which may have many deficiencies) with a suite of
transformations. The vast majority of the legwork is done within the
mlds.datasets.process method; once datasets are downloaded via downloader,
then data transformations are composed and applied via
ColumnTransformer.
As described above, if “train” is a key in the dictionary representing the
dataset, then the transformers are fit only to this partition.

Notably, the order of features is preserved, post-transformation (including
expanding categorical features with one-hot vectors), and a set of metadata is
saved with the transformed dataset, including: the names of the partitions, the
transformations that were applied, the current abbreviated commit hash of the
repo, the number of samples, the number of features, the number of classes,
one-hot mappings (if applicable), and class mappings (if a label encoding
scheme was applied). Metadata that pertains to the entire dataset (such as the
transformations that were applied) is saved as a dictionary attribute for
Dataset objects, while partition-specific information (such as the number of
samples) is saved as an attribute for Partition objects (which are set as
attributes in Dataset objects).