relational-datasets
A small library for loading and downloading relational datasets.
pip install relational-datasets
Beta Release
This API and the datasets at https://github.com/srlearn/datasets/ are currently being experimented with.
- Prefer Julia? Check out RelationalDatasets.jl.
Open enhancements and bugs are tracked here:
But here is a short-term Roadmap:
- Modes: srlearn/datasets: Issue 11
-
Converting propositional->relational
-
Problem Settings
-
Binary Classification
- Classification: (0, 1)
- Classification: (-1, 1)
-
Classification: maybe recommend
sklearn.preprocessing.LabelBinarizer
-
Regression
-
Regression: y β
float
-
Regression: y β
-
Multiclass Classification: When target is
int
and in[0, 1, 2, ...]
-
Binary Classification
-
Categorical datatype support in
X
matrix. -
Dataframes:
pandas
-
Problem Settings
Use Case 1: Fetching Zipfiles
Running the fetch
method downloads a version of a datset to your local cache:
import relational_datasets
relational_datasets.fetch("toy_cancer")
relational_datasets.fetch("toy_father", "v0.0.3")
relational_datasets.fetch("cora")
Resulting in:
~/relational_datasets/
βββ toy_cancer_v0.0.4.zip <--- latest
βββ toy_father_v0.0.3.zip <--- specific version
βββ cora_v0.0.4.zip <--- latest
Use Case 2: Loading Data
The load
method returns train and test foldsβeach with pos
, neg
, and
facts
. Internally it uses fetch
, so it will automatically download a
dataset if it is not available.
For example: "Load fold-2 of webkb"
from relational_datasets import load
train, test = load("webkb", "v0.0.4", fold=2)
len(train.facts)
# 1344
Use Case 3: Working with Standard (Vector-based) Machine Learning Datasets
The relational_datasets.convert
module has functions for
converting vector-based datasets into relational/ILP-style
datasets:
Binary Classification
When y
is a vector of 0/1
from relational_datasets.convert import from_numpy
import numpy as np
data, modes = from_numpy(
np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
np.array([0, 0, 1]),
)
data, modes
(RelationalDataset(pos=['v4(id3).'], neg=['v4(id1).', 'v4(id2).'], facts=['v1(id1,0).', 'v1(id2,0).', 'v1(id3,1).', 'v2(id1,1).', 'v2(id2,1).', 'v2(id3,2).', 'v3(id1,1).', 'v3(id2,2).', 'v3(id3,2).']),
['v1(+id,#varv1).', 'v2(+id,#varv2).', 'v3(+id,#varv3).', 'v4(+id).'])
Regression
When y
is a vector of floats
from relational_datasets.convert import from_numpy
import numpy as np
data, modes = from_numpy(
np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
np.array([1.1, 0.9, 2.5]),
)
data, modes
(RelationalDataset(pos=['regressionExample(v4(id1),1.1).', 'regressionExample(v4(id2),0.9).', 'regressionExample(v4(id3),2.5).'], neg=[], facts=['v1(id1,0).', 'v1(id2,0).', 'v1(id3,1).', 'v2(id1,1).', 'v2(id2,1).', 'v2(id3,2).', 'v3(id1,1).', 'v3(id2,2).', 'v3(id3,2).']),
['v1(+id,#varv1).', 'v2(+id,#varv2).', 'v3(+id,#varv3).', 'v4(+id).'])
load_breast_cancer
Preprocessing scikit-learn's load_breast_cancer
is based on the
Breast Cancer Wisconsin dataset.
Here we: (1) load the data and class labels, (2) split into training and test sets, (3) bin the continuous features to discrete, and (4) convert to the relational format.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer
# (1) Load
X, y = load_breast_cancer(return_X_y=True)
# (2) Split
X_train, X_test, y_train, y_test = train_test_split(X, y)
# (3) Discretize
disc = KBinsDiscretizer(n_bins=5, encode="ordinal")
X_train = disc.fit_transform(X_train)
X_test = disc.transform(X_test)
X_train = X_train.astype(int)
X_test = X_test.astype(int)
# (4) Convert
from relational_datasets.convert import from_numpy
train, modes = from_numpy(X_train, y_train)
test, _ = from_numpy(X_test, y_test)
Install
From PyPi
pip install relational-datasets
From GitHub Source
git clone https://github.com/srlearn/relational-datasets.git
cd relational-datasets
pip install -e .
Contributions
- Alexander Hayes - Indiana University, Bloomington
This package was partially based on datasets from the Starling Lab Datasets Collection, which included specific contributions by Harsha Kokel and Devendra Singh Dhami. Tushar Khot converted many to the ILP format from Alchemy 2 format, but that occurred before versions were tracked. Some inspiration was drawn from the "RelationalDatasets" list that Jonas Schouterden collected.