We’ve released a new open dataset on Hugging Face: GARCH Densities, a large-scale benchmark for density estimation, option pricing, and risk modeling in quantitative finance.
Created with Paul Wilmott, this dataset contains simulations from the GJR-GARCH model with Hansen skewed-t innovations. Each row links a parameter set
to the inverse-CDF quantiles of terminal returns over multiple maturities.

Dataset highlights:
- ~1,000 trillion simulated price paths across 6D parameter space
- Quantiles of normalized returns at 512 probability levels
- Low-discrepancy Sobol sampling under
- zstd-compressed Parquet shards (~145 MB each, streaming-friendly)
- CC-BY-4.0 licensed — free for academic and commercial use
Applications:
- Learning
return distributions (pretrained pricing models)
- Fast neural surrogates for Monte Carlo simulation
- Option pricing, VaR/CVaR, and volatility surface modeling
- Benchmarking generative and density estimation models

Example usage:
from datasets import load_dataset
ds = load_dataset("sitmo/garch_densities")
train, test = ds["train"], ds["test"]
print(train)
print(train.features)
PyTorch integration:
import torch
from torch.utils.data import DataLoader
param_cols = ["alpha","gamma","beta","var0","eta","lam","ti"]
train = train.with_format("torch", columns=param_cols + ["x"])
loader = DataLoader(train, batch_size=256, shuffle=True)
batch = next(iter(loader))
params, targets = torch.stack([batch[c] for c in param_cols], 1), batch["x"]
print(params.shape, targets.shape)
Goal: accelerate research on pretrained neural pricing models and scientific foundation models in finance — replacing slow Monte Carlo with fast, learned surrogates.
We’d love to see how researchers use this dataset. If you find it useful, consider sharing it or referencing it in your work.
License: CC-BY-4.0
Authors: Thijs van den Berg & Paul Wilmott
Link: https://huggingface.co/datasets/sitmo/garch_densities