项目作者: mlr-org

项目描述 :
Connect mlr3 with OpenML
高级语言: R
项目地址: git://github.com/mlr-org/mlr3oml.git
创建时间: 2019-10-13T12:05:23Z
项目社区:https://github.com/mlr-org/mlr3oml

开源协议:Other

下载


mlr3oml

Package website: release |
dev

OpenML integration to the mlr3 ecosystem.

r-cmd-check
CRAN Status
Badge
StackOverflow
Mattermost

What is mlr3oml?

OpenML is an open-source platform that
facilitates the sharing and dissemination of machine learning research
data. All entities on the platform have unique identifiers and
standardized (meta)data that can be accessed via an open-access REST API
or the web interface. mlr3oml allows to work with the REST API through
R and integrates OpenML with the mlr3
ecosystem. Note that some upload options are currently not supported,
use the OpenML package
package for this.

As a brief demo, we show how to access an OpenML task, convert it to an
mlr3::Task and associated mlr3::Resampling, and conduct a simple
resample experiment.

  1. library(mlr3oml)
  2. library(mlr3)
  3. # Download and print the OpenML task with ID 145953
  4. oml_task = otsk(145953)
  5. oml_task
  1. ## <OMLTask:145953>
  2. ## * Type: Supervised Classification
  3. ## * Data: kr-vs-kp (id: 3; dim: 3196x37)
  4. ## * Target: class
  5. ## * Estimation: crossvalidation (id: 1; repeats: 1, folds: 10)
  1. # Access the OpenML data object on which the task is built
  2. oml_task$data
  1. ## <OMLData:3:kr-vs-kp> (3196x37)
  2. ## * Default target: class
  1. # Convert the OpenML task to an mlr3 task and resampling
  2. task = as_task(oml_task)
  3. resampling = as_resampling(oml_task)
  4. # Conduct a simple resample experiment
  5. rr = resample(task, lrn("classif.rpart"), resampling)
  6. rr$aggregate()
  1. ## classif.ce
  2. ## 0.0319181

Besides working with objects with known IDs, data of interest can also
be queried using listing functions. Below, we search for datasets with
10 - 20 features, 100 to 10000 observations and 2 classes.

  1. odatasets = list_oml_data(
  2. number_features = c(10, 20),
  3. number_instances = c(100, 10000),
  4. number_classes = 2
  5. )
  6. head(odatasets[, c("data_id", "name")])
  1. ## data_id name
  2. ## 1: 13 breast-cancer
  3. ## 2: 15 breast-w
  4. ## 3: 29 credit-approval
  5. ## 4: 49 heart-c
  6. ## 5: 50 tic-tac-toe
  7. ## 6: 51 heart-h

To retrieve individual datasets, you can use odt and either manually
construct a new Task object using as_task() or use it data.table
format.

  1. odataset = odt(29)
  2. # Dataset as data.table
  3. str(odataset$data)
  1. ## Classes 'data.table' and 'data.frame': 690 obs. of 16 variables:
  2. ## $ A1 : Factor w/ 2 levels "b","a": 1 2 2 1 1 1 1 2 1 1 ...
  3. ## $ A2 : num 30.8 58.7 24.5 27.8 20.2 ...
  4. ## $ A3 : num 0 4.46 0.5 1.54 5.62 ...
  5. ## $ A4 : Factor w/ 4 levels "u","y","l","t": 1 1 1 1 1 1 1 1 2 2 ...
  6. ## $ A5 : Factor w/ 3 levels "g","p","gg": 1 1 1 1 1 1 1 1 2 2 ...
  7. ## $ A6 : Factor w/ 14 levels "c","d","cc","i",..: 10 9 9 10 10 7 8 3 6 10 ...
  8. ## $ A7 : Factor w/ 9 levels "v","h","bb","j",..: 1 2 2 1 1 1 2 1 2 1 ...
  9. ## $ A8 : num 1.25 3.04 1.5 3.75 1.71 ...
  10. ## $ A9 : Factor w/ 2 levels "t","f": 1 1 1 1 1 1 1 1 1 1 ...
  11. ## $ A10 : Factor w/ 2 levels "t","f": 1 1 2 1 2 2 2 2 2 2 ...
  12. ## $ A11 : int 1 6 0 5 0 0 0 0 0 0 ...
  13. ## $ A12 : Factor w/ 2 levels "t","f": 2 2 2 1 2 1 1 2 2 1 ...
  14. ## $ A13 : Factor w/ 3 levels "g","p","s": 1 1 1 1 3 1 1 1 1 1 ...
  15. ## $ A14 : int 202 43 280 100 120 360 164 80 180 52 ...
  16. ## $ A15 : int 0 560 824 3 0 0 31285 1349 314 1442 ...
  17. ## $ class: Factor w/ 2 levels "+","-": 1 1 1 1 1 1 1 1 1 1 ...
  18. ## - attr(*, ".internal.selfref")=<externalptr>
  1. # Creating a new task
  2. otask = as_task(odataset)
  3. otask
  1. ## <TaskClassif:credit-approval> (690 x 16)
  2. ## * Target: class
  3. ## * Properties: twoclass
  4. ## * Features (15):
  5. ## - fct (9): A1, A10, A12, A13, A4, A5, A6, A7, A9
  6. ## - int (3): A11, A14, A15
  7. ## - dbl (3): A2, A3, A8

Feature Overview

  • Datasets, tasks, flows, runs, and collections can be downloaded from
    OpenML and are represented as R6 classes.
  • OpenML objects can be easily converted to the corresponding mlr3
    counterpart.
  • Filtering of OpenML objects can be achieved using listing functions.
  • Downloaded objects can be cached by setting the mlr3oml.cache
    option.
  • Both the arff and parquet filetype for datasets are supported.
  • You can upload datasets, tasks, and collections to OpenML.

Documentation

Bugs, Questions, Feedback

mlr3oml is a free and open source software project that encourages
participation and feedback. If you have any issues, questions,
suggestions or feedback, please do not hesitate to open an “issue” about
it on the GitHub page!

In case of problems / bugs, it is often helpful if you provide a
“minimum working example” that showcases the behaviour (but don’t worry
about this if the bug is obvious).