Getting Started with Dream Trainer¶
This guide will help you get up and running with Dream Trainer quickly. We'll cover installation, basic usage, and walk through a complete example.
Table of Contents¶
- Installation
- Basic Usage
- Your First Training Run
- Multi-GPU Training
- Logging and Monitoring
- Next Steps
Installation¶
Basic Installation¶
Install Dream Trainer using pip:
pip install dream-trainer
Optional Dependencies¶
Dream Trainer has several optional dependencies for additional features:
# For WandB logging
pip install dream-trainer[wandb]
# For rich progress bars
pip install dream-trainer[rich]
# For metric tracking
pip install dream-trainer[metrics]
# For FP8 quantization support
pip install dream-trainer[torchao]
# For fault tolerance
pip install dream-trainer[torchft]
# Install all optional dependencies
pip install dream-trainer[all]
Development Installation¶
For development or to run examples:
git clone https://github.com/dream3d/dream-trainer.git
cd dream-trainer
pip install -e ".[dev]"
Basic Usage¶
1. Create Your Trainer¶
First, create a custom trainer by extending DreamTrainer
:
from dream_trainer import DreamTrainer, DreamTrainerConfig
from dream_trainer.configs import TrainingParameters, DeviceParameters
import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader, TensorDataset
class MyTrainer(DreamTrainer):
def __init__(self, config: DreamTrainerConfig, model: nn.Module):
super().__init__(config)
self.model = model
def configure_models(self):
"""Configure your model(s) here"""
# Models are automatically moved to the correct device
# and wrapped with distributed training wrappers
pass
def configure_optimizers(self):
"""Configure optimizer(s)"""
self.optimizer = AdamW(
self.model.parameters(),
lr=1e-4,
weight_decay=0.01
)
def configure_dataloaders(self):
"""Configure train and validation dataloaders"""
# Example dummy data
train_data = TensorDataset(
torch.randn(1000, 10),
torch.randint(0, 2, (1000,))
)
val_data = TensorDataset(
torch.randn(100, 10),
torch.randint(0, 2, (100,))
)
train_loader = DataLoader(
train_data,
batch_size=32,
shuffle=True
)
val_loader = DataLoader(
val_data,
batch_size=32,
shuffle=False
)
return train_loader, val_loader
def training_step(self, batch, batch_idx):
"""Define a single training step"""
inputs, targets = batch
# Forward pass
outputs = self.model(inputs)
loss = nn.functional.cross_entropy(outputs, targets)
# Backward pass (handled automatically)
self.backward(loss)
# Return metrics to log
return {
"loss": loss,
"lr": self.optimizer.param_groups[0]["lr"]
}
def validation_step(self, batch, batch_idx):
"""Define a single validation step"""
inputs, targets = batch
# Forward pass
outputs = self.model(inputs)
loss = nn.functional.cross_entropy(outputs, targets)
# Calculate accuracy
preds = outputs.argmax(dim=1)
accuracy = (preds == targets).float().mean()
return {
"val_loss": loss,
"val_accuracy": accuracy
}
2. Configure Your Training¶
Create a configuration for your training run:
from dream_trainer.callbacks import (
LoggerCallback,
ProgressBar,
CallbackCollection
)
# Create configuration
config = DreamTrainerConfig(
# Project settings
project="my-ml-project",
group="classification",
experiment="baseline-v1",
# Device settings
device_parameters=DeviceParameters(
# Distributed training settings
data_parallel_size=1, # Number of GPUs for data parallelism
tensor_parallel_size=1, # Tensor parallelism degree
pipeline_parallel_size=1, # Pipeline parallelism degree
# Performance settings
compile_model=True, # Use torch.compile
param_dtype=torch.bfloat16, # Mixed precision
),
# Training settings
training_parameters=TrainingParameters(
n_epochs=10,
train_batch_size=32,
gradient_clip_val=1.0,
checkpoint_activations=False,
# Validation settings
val_frequency=0.5, # Validate every half epoch
num_sanity_val_steps=2, # Sanity check before training
),
# Callbacks
callbacks=CallbackCollection([
LoggerCallback(), # Logs metrics to console/WandB
ProgressBar(), # Shows training progress
])
)
3. Train Your Model¶
# Create model
model = nn.Sequential(
nn.Linear(10, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 2)
)
# Create trainer
trainer = MyTrainer(config, model)
if __name__ == "__main__":
from dream_trainer.utils.entrypoint import entrypoint
entrypoint(trainer.fit)
Multi-GPU Training¶
Dream Trainer makes distributed training simple. To use multiple GPUs:
Single Node, Multiple GPUs¶
# Using torchrun (recommended)
torchrun --nproc_per_node=4 train.py
# Or using the trainer directly with updated config
config = DreamTrainerConfig(
# ... other settings ...
device_parameters=DeviceParameters(
data_parallel_size=4, # Use 4 GPUs
)
)
Multiple Nodes¶
# On node 0
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \
--master_addr=$MASTER_ADDR --master_port=$MASTER_PORT train.py
# On node 1
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 \
--master_addr=$MASTER_ADDR --master_port=$MASTER_PORT train.py
Logging and Monitoring¶
WandB Integration¶
To use Weights & Biases for experiment tracking:
from dream_trainer.configs import WandBParameters
config = DreamTrainerConfig(
# ... other settings ...
wandb_parameters=WandBParameters(
project="my-project",
entity="my-team",
tags=["experiment", "classification"],
notes="Initial baseline run"
)
)
Custom Logging¶
You can create custom logging callbacks:
from dream_trainer.callbacks import Callback
class CustomLogger(Callback):
def on_train_batch_end(self, trainer, outputs, batch, batch_idx):
# Log custom metrics
trainer.log("custom_metric", outputs["custom_metric"])
Next Steps¶
Now that you have the basics, here are some recommended next steps:
- Read the Configuration Guide to learn about all available options
- Check out the Trainer Guide for advanced trainer customization
- Explore Callbacks to extend functionality
- Try Distributed Training for multi-GPU setups
- Look at Examples for complete working code
Common Issues¶
Installation Problems¶
If you encounter installation issues:
- Make sure you have Python 3.10+ installed
- Ensure PyTorch is installed correctly for your CUDA version
- Try installing in a fresh virtual environment
Training Issues¶
Common training problems and solutions:
- Out of Memory: Reduce batch size or enable gradient accumulation
- Slow Training: Enable mixed precision or model compilation
- Poor Performance: Check learning rate and optimizer settings
Getting Help¶
If you need help:
- Check the documentation
- Look at examples
- Open an issue on GitHub
- Join our community chat