Skip to content

Commit

Permalink
Add a benchmark on cmc dataset and Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Alcoholrithm committed Apr 16, 2024
1 parent 3b158c6 commit c236588
Show file tree
Hide file tree
Showing 4 changed files with 132 additions and 97 deletions.
188 changes: 94 additions & 94 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
[**Overview**](#tabulars3l)
| [**Installation**](#installation)
| [**Available Models with Quick Start Guides**](#available-models-with-quick-start)
| [**Benchmark**](#benchmark)
| [**To DO**](#to-do)
| [**Contributing**](#contributing)
| [**Credit**](#credit)
Expand All @@ -21,31 +22,6 @@ We provide a Python package ts3l of TabularS3L for users who want to use semi- a
pip install ts3l
```

## Benchmark
We provide a simple benchmark using TabularS3L against XGBoost. The train-validation-test ratio is 6:2:2 and we tuned each model over 50 trials using Optuna. The results are the average of the random seeds [0,4]. The best results are bold.

Use this benchmark for reference only, as only a small number of random seeds were used.

##### 10% labeled samples

| Model | diabetes (acc) | abalone (mse) |
|:---:|:---:|:---:|
| XGBoost | 0.7325 | **5.5739** |
| VIME | 0.7182 | 5.6637 |
| SubTab | 0.7312 | 7.2418 |
| SCARF | **0.7416** | 5.8888 |

--------

##### 100% labeled samples

| Model | diabetes (acc) | abalone (mse) |
|:---:|:---:|:---:|
| XGBoost | 0.7234 | 4.8377 |
| VIME | **0.7688** | 4.5804 |
| SubTab | 0.7390 | 6.3104 |
| SCARF | 0.7442 | **4.4443** |

## Available Models with Quick Start

TabularS3L employs a two-phase learning approach, where the learning strategies differ between phases. Below is an overview of the models available within TabularS3L, highlighting the learning strategies employed in each phase. The abbreviations 'Self-SL', 'Semi-SL', and 'SL' represent self-supervised learning, semi-supervised learning, and supervised learning, respectively.
Expand All @@ -63,16 +39,15 @@ VIME enhances tabular data learning through a dual approach. In its first phase,
<summary>Quick Start</summary>

```python

# Assume that we have X_train, X_valid, X_test, y_train, y_valid, y_test, categorical_cols, and continuous_cols

# Prepare the VIMELightning Module
from ts3l.pl_modules import VIMELightning
from ts3l.utils.vime_utils import VIMEDataset
from ts3l.utils import TS3LDataModule
from ts3l.utils.vime_utils import VIMEConfig
from pytorch_lightning import Trainer

metric = "accuracy_score"
input_dim = X_train.shape[1]
hidden_dim = 1024
Expand All @@ -82,64 +57,62 @@ VIME enhances tabular data learning through a dual approach. In its first phase,
beta = 1.0
K = 3
p_m = 0.2

batch_size = 128

X_train, X_unlabeled, y_train, _ = train_test_split(X_train, y_train, train_size = 0.1, random_state=0, stratify=y_train)

config = VIMEConfig( task="classification", loss_fn="CrossEntropyLoss", metric=metric, metric_hparams={},
input_dim=input_dim, hidden_dim=hidden_dim,
output_dim=output_dim, alpha1=alpha1, alpha2=alpha2,
beta=beta, K=K, p_m = p_m,
num_categoricals=len(category_cols), num_continuous=len(continuous_cols)
)

pl_vime = VIMELightning(config)

### First Phase Learning
train_ds = VIMEDataset(X = X_train, unlabeled_data = X_unlabeled, config=config, continous_cols = continuous_cols, category_cols = category_cols)
valid_ds = VIMEDataset(X = X_valid, config=config, continous_cols = continuous_cols, category_cols = category_cols)
train_ds = VIMEDataset(X = X_train, unlabeled_data = X_unlabeled, config=config, continuous_cols = continuous_cols, category_cols = category_cols)
valid_ds = VIMEDataset(X = X_valid, config=config, continuous_cols = continuous_cols, category_cols = category_cols)

datamodule = TS3LDataModule(train_ds, valid_ds, batch_size, train_sampler='random')

trainer = Trainer(
accelerator = 'cpu',
max_epochs = 20,
num_sanity_val_steps = 2,
)

trainer.fit(pl_vime, datamodule)

### Second Phase Learning
from ts3l.utils.vime_utils import VIMESemiSLCollateFN

pl_vime.set_second_phase()
train_ds = VIMEDataset(X_train, y_train.values, config, unlabeled_data=X_unlabeled, continous_cols=continuous_cols, category_cols=category_cols, is_second_phase=True)
valid_ds = VIMEDataset(X_valid, y_valid.values, config, continous_cols=continuous_cols, category_cols=category_cols, is_second_phase=True)

train_ds = VIMEDataset(X_train, y_train.values, config, unlabeled_data=X_unlabeled, continuous_cols=continuous_cols, category_cols=category_cols, is_second_phase=True)
valid_ds = VIMEDataset(X_valid, y_valid.values, config, continuous_cols=continuous_cols, category_cols=category_cols, is_second_phase=True)

datamodule = TS3LDataModule(train_ds, valid_ds, batch_size = batch_size, train_sampler="weighted", train_collate_fn=VIMESemiSLCollateFN())

trainer.fit(pl_vime, datamodule)

# Evaluation
from sklearn.metrics import accuracy_score
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, SequentialSampler
test_ds = VIMEDataset(X_test, category_cols=category_cols, continous_cols=continuous_cols, is_second_phase=True)

test_ds = VIMEDataset(X_test, category_cols=category_cols, continuous_cols=continuous_cols, is_second_phase=True)
test_dl = DataLoader(test_ds, batch_size, shuffle=False, sampler = SequentialSampler(test_ds))

preds = trainer.predict(pl_vime, test_dl)

preds = F.softmax(torch.concat([out.cpu() for out in preds]).squeeze(),dim=1)

accuracy = accuracy_score(y_test, preds.argmax(1))

print("Accuracy %.2f" % accuracy)


print("Accuracy %.2f" % accuracy)
```

</details>
Expand All @@ -160,7 +133,7 @@ SubTab turns the task of learning from tabular data into as a multi-view represe
from ts3l.utils import TS3LDataModule
from ts3l.utils.subtab_utils import SubTabConfig
from pytorch_lightning import Trainer

metric = "accuracy_score"
input_dim = X_train.shape[1]
hidden_dim = 1024
Expand All @@ -171,64 +144,64 @@ SubTab turns the task of learning from tabular data into as a multi-view represe
use_distance = True
n_subsets = 4
overlap_ratio = 0.75

mask_ratio = 0.1
noise_type = "Swap"
noise_level = 0.1

batch_size = 128
max_epochs = 3

X_train, X_unlabeled, y_train, _ = train_test_split(X_train, y_train, train_size = 0.1, random_state=0, stratify=y_train)

config = SubTabConfig( task="classification", loss_fn="CrossEntropyLoss", metric=metric, metric_hparams={},
input_dim=input_dim, hidden_dim=hidden_dim,
output_dim=output_dim, tau=tau, use_cosine_similarity= use_cosine_similarity, use_contrastive=use_contrastive, use_distance=use_distance,
n_subsets=n_subsets, overlap_ratio=overlap_ratio, mask_ratio=mask_ratio, noise_type=noise_type, noise_level=noise_level
)

pl_subtab = SubTabLightning(config)

### First Phase Learning
train_ds = SubTabDataset(X_train, unlabeled_data=X_unlabeled)
valid_ds = SubTabDataset(X_valid)

datamodule = TS3LDataModule(train_ds, valid_ds, batch_size, train_sampler='random', train_collate_fn=SubTabCollateFN(config), valid_collate_fn=SubTabCollateFN(config), n_jobs = 4)

trainer = Trainer(
accelerator = 'cpu',
max_epochs = max_epochs,
num_sanity_val_steps = 2,
)

trainer.fit(pl_subtab, datamodule)

### Second Phase Learning

pl_subtab.set_second_phase()

train_ds = SubTabDataset(X_train, y_train.values)
valid_ds = SubTabDataset(X_valid, y_valid.values)

datamodule = TS3LDataModule(train_ds, valid_ds, batch_size = batch_size, train_sampler="weighted", train_collate_fn=SubTabCollateFN(config), valid_collate_fn=SubTabCollateFN(config))

trainer.fit(pl_subtab, datamodule)

# Evaluation
from sklearn.metrics import accuracy_score
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, SequentialSampler

test_ds = SubTabDataset(X_test)
test_dl = DataLoader(test_ds, batch_size, shuffle=False, sampler = SequentialSampler(test_ds), num_workers=4, collate_fn=SubTabCollateFN(config))

preds = trainer.predict(pl_subtab, test_dl)

preds = F.softmax(torch.concat([out.cpu() for out in preds]).squeeze(),dim=1)

accuracy = accuracy_score(y_test, preds.argmax(1))

print("Accuracy %.2f" % accuracy)
```

Expand All @@ -242,83 +215,110 @@ SCARF introduces a contrastive learning framework specifically tailored for tabu

```python
# Assume that we have X_train, X_valid, X_test, y_train, y_valid, y_test, categorical_cols, and continuous_cols

# Prepare the SCARFLightning Module
from ts3l.pl_modules import SCARFLightning
from ts3l.utils.scarf_utils import SCARFDataset
from ts3l.utils import TS3LDataModule
from ts3l.utils.scarf_utils import SCARFConfig
from pytorch_lightning import Trainer

metric = "accuracy_score"
input_dim = X_train.shape[1]
hidden_dim = 1024
output_dim = 2
encoder_depth = 3
head_depth = 1
dropout_rate = 0.04

corruption_rate = 0.6

batch_size = 128
max_epochs = 10

X_train, X_unlabeled, y_train, _ = train_test_split(X_train, y_train, train_size = 0.1, random_state=0, stratify=y_train)

config = SCARFConfig( task="classification", loss_fn="CrossEntropyLoss", metric=metric, metric_hparams={},
input_dim=input_dim, hidden_dim=hidden_dim,
output_dim=output_dim, encoder_depth=encoder_depth, head_depth=head_depth,
dropout_rate=dropout_rate, corruption_rate = corruption_rate
)

pl_scarf = SCARFLightning(config)

### First Phase Learning
train_ds = SCARFDataset(X_train, unlabeled_data=X_unlabeled, config = config)
valid_ds = SCARFDataset(X_valid, config=config)

datamodule = TS3LDataModule(train_ds, valid_ds, batch_size=batch_size, train_sampler="random")

trainer = Trainer(
accelerator = 'cpu',
max_epochs = max_epochs,
num_sanity_val_steps = 2,
)

trainer.fit(pl_scarf, datamodule)

### Second Phase Learning

pl_scarf.set_second_phase()

train_ds = SCARFDataset(X_train, y_train.values, is_second_phase=True)
valid_ds = SCARFDataset(X_valid, y_valid.values, is_second_phase=True)

datamodule = TS3LDataModule(train_ds, valid_ds, batch_size = batch_size, train_sampler="weighted")

trainer.fit(pl_scarf, datamodule)

# Evaluation
from sklearn.metrics import accuracy_score
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, SequentialSampler

test_ds = SCARFDataset(X_test, is_second_phase=True)
test_dl = DataLoader(test_ds, batch_size, shuffle=False, sampler = SequentialSampler(test_ds), num_workers=4)

preds = trainer.predict(pl_scarf, test_dl)

preds = F.softmax(torch.concat([out.cpu() for out in preds]).squeeze(),dim=1)

accuracy = accuracy_score(y_test, preds.argmax(1))

print("Accuracy %.2f" % accuracy)
```

</details>

#### To DO

## Benchmark

We provide a simple benchmark using TabularS3L against XGBoost. The train-validation-test ratio is 6:2:2 and we tuned each model over 50 trials using Optuna. The results are the average of the random seeds [0,4]. The best results are bold. 'acc', 'b-acc', and 'mse' mean 'Accuracy', 'Balanced Accuracy', and 'Mean Squared Error', respectively.

Use this benchmark for reference only, as only a small number of random seeds were used.

##### 10% labeled samples

| Model | diabetes (acc) | cmc (b-acc) | abalone (mse) |
|:---:|:---:|:---:|:---:|
| XGBoost | 0.7325 | 0.4763 | **5.5739** |
| VIME | 0.7182 | **0.5087** | 5.6637 |
| SubTab | 0.7312 | 0.4930 | 7.2418 |
| SCARF | **0.7416** | 0.4710 | 5.8888 |

--------

##### 100% labeled samples

| Model | diabetes (acc) | cmc (b-acc) | abalone (mse) |
|:---:|:---:|:---:|:---:|
| XGBoost | 0.7234 | 0.5291 | 4.8377 |
| VIME | **0.7688** | 0.5477 | 4.5804 |
| SubTab | 0.7390 | 0.5432 | 6.3104 |
| SCARF | 0.7442 | **0.5521** | **4.4443** |

## To DO

- [x] Release nn.Module and Dataset of VIME, SubTab, and SCARF
- [x] VIME
Expand Down
6 changes: 4 additions & 2 deletions benchmark/benchmark.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
import argparse
from datasets import load_diabetes, load_abalone
from datasets import load_diabetes, load_abalone, load_cmc
from pipelines import VIMEPipeLine, SubTabPipeLine, SCARFPipeLine, XGBPipeLine

def main():
parser = argparse.ArgumentParser(add_help=True)

parser.add_argument('--model', type=str, choices=["xgb", "vime", "subtab", "scarf"])
parser.add_argument('--data', type=str, choices=["diabetes", "abalone"])
parser.add_argument('--data', type=str, choices=["diabetes", "abalone", "cmc"])

parser.add_argument('--labeled_sample_ratio', type=float, default=0.1)
parser.add_argument('--valid_size', type=float, default=0.2)
Expand Down Expand Up @@ -34,6 +34,8 @@ def main():
load_data = load_diabetes
elif args.data == "abalone":
load_data = load_abalone
elif args.data == "cmc":
load_data = load_cmc

data, label, continuous_cols, category_cols, output_dim, metric, metric_hparams = load_data()

Expand Down
Loading

0 comments on commit c236588

Please sign in to comment.