Configuration Guide
This guide explains all configuration options available in Dream Trainer.
Table of Contents
Basic Configuration
The main configuration class is DreamTrainerConfig
. Here's a basic example:
from dream_trainer import DreamTrainerConfig
config = DreamTrainerConfig (
project = "my-project" ,
group = "experiments" ,
experiment = "run-001"
)
Project Settings
Parameter
Type
Description
project
str
Project name for organization
group
str
Group name for related experiments
experiment
str
Unique experiment identifier
Device Parameters
Configure hardware and distributed training settings:
from dream_trainer.configs import DeviceParameters
import torch
device_params = DeviceParameters (
# Distributed training
data_parallel_size = 1 ,
tensor_parallel_size = 1 ,
pipeline_parallel_size = 1 ,
context_parallel_size = 1 ,
# Performance
compile_model = True ,
param_dtype = torch . bfloat16 ,
activation_dtype = torch . bfloat16 ,
# Memory optimization
checkpoint_activations = False ,
offload_optimizer = False ,
offload_parameters = False
)
Distributed Training
Parameter
Type
Description
data_parallel_size
int
Number of GPUs for data parallelism
tensor_parallel_size
int
Tensor parallelism degree
pipeline_parallel_size
int
Pipeline parallelism degree
context_parallel_size
int
Context parallelism degree
Parameter
Type
Description
compile_model
bool
Use torch.compile for optimization
param_dtype
torch.dtype
Parameter data type (e.g., bfloat16)
activation_dtype
torch.dtype
Activation data type
Memory Optimization
Parameter
Type
Description
checkpoint_activations
bool
Enable activation checkpointing
offload_optimizer
bool
Offload optimizer states to CPU
offload_parameters
bool
Offload parameters to CPU
Training Parameters
Configure training loop settings:
from dream_trainer.configs import TrainingParameters
training_params = TrainingParameters (
# Basic training
n_epochs = 10 ,
train_batch_size = 32 ,
val_batch_size = 32 ,
# Optimization
gradient_clip_val = 1.0 ,
gradient_accumulation_steps = 1 ,
max_grad_norm = 1.0 ,
# Validation
val_frequency = 0.5 ,
num_sanity_val_steps = 2 ,
# Learning rate
learning_rate = 1e-4 ,
weight_decay = 0.01 ,
warmup_steps = 1000
)
Basic Training
Parameter
Type
Description
n_epochs
int
Number of training epochs
train_batch_size
int
Training batch size
val_batch_size
int
Validation batch size
Optimization
Parameter
Type
Description
gradient_clip_val
float
Gradient clipping value
gradient_accumulation_steps
int
Steps for gradient accumulation
max_grad_norm
float
Maximum gradient norm
Validation
Parameter
Type
Description
val_frequency
float
Validation frequency (epochs)
num_sanity_val_steps
int
Sanity check steps
Learning Rate
Parameter
Type
Description
learning_rate
float
Initial learning rate
weight_decay
float
Weight decay coefficient
warmup_steps
int
Learning rate warmup steps
Checkpoint Parameters
Configure model checkpointing:
from dream_trainer.configs import CheckpointParameters
checkpoint_params = CheckpointParameters (
# Basic settings
root_dir = "./checkpoints" ,
monitor = "val_loss" ,
mode = "min" ,
# Checkpoint frequency
checkpoint_every_n_epochs = 1 ,
checkpoint_every_n_steps = None ,
# Checkpoint management
keep_top_k = 3 ,
save_last = True ,
# Resume settings
resume_mode = "latest" , # or "best"
resume_path = None
)
Basic Settings
Parameter
Type
Description
root_dir
str
Checkpoint directory
monitor
str
Metric to monitor
mode
str
"min" or "max" for monitored metric
Checkpoint Frequency
Parameter
Type
Description
checkpoint_every_n_epochs
int
Save every N epochs
checkpoint_every_n_steps
int
Save every N steps
Checkpoint Management
Parameter
Type
Description
keep_top_k
int
Keep best K checkpoints
save_last
bool
Always save latest
Resume Settings
Parameter
Type
Description
resume_mode
str
"latest" or "best"
resume_path
str
Path to checkpoint
Logging Parameters
Configure experiment tracking:
from dream_trainer.configs import WandBParameters
wandb_params = WandBParameters (
# Basic settings
project = "my-project" ,
entity = "my-team" ,
# Experiment info
tags = [ "experiment" , "classification" ],
notes = "Initial baseline run" ,
# Logging settings
log_model = True ,
log_artifacts = True ,
log_code = True
)
Basic Settings
Parameter
Type
Description
project
str
WandB project name
entity
str
WandB entity/team
Experiment Info
Parameter
Type
Description
tags
List[str]
Experiment tags
notes
str
Experiment notes
Logging Settings
Parameter
Type
Description
log_model
bool
Log model checkpoints
log_artifacts
bool
Log artifacts
log_code
bool
Log code changes
Advanced Configuration
Custom Configuration
You can create custom configuration classes:
from dream_trainer.configs import BaseConfig
class CustomConfig ( BaseConfig ):
def __init__ (
self ,
custom_param : str ,
** kwargs
):
super () . __init__ ( ** kwargs )
self . custom_param = custom_param
Configuration Inheritance
Configurations can be inherited and extended:
class ExtendedConfig ( DreamTrainerConfig ):
def __init__ (
self ,
new_param : int ,
** kwargs
):
super () . __init__ ( ** kwargs )
self . new_param = new_param
Configuration Validation
Add validation to your configurations:
from dream_trainer.configs import BaseConfig
from typing import Optional
class ValidatedConfig ( BaseConfig ):
def __init__ (
self ,
required_param : str ,
optional_param : Optional [ int ] = None ,
** kwargs
):
super () . __init__ ( ** kwargs )
self . required_param = required_param
self . optional_param = optional_param
def validate ( self ):
if not self . required_param :
raise ValueError ( "required_param cannot be empty" )
if self . optional_param is not None and self . optional_param < 0 :
raise ValueError ( "optional_param must be non-negative" )
Best Practices
Use Type Hints : Always use type hints for better IDE support
Validate Inputs : Add validation for critical parameters
Document Parameters : Add docstrings for custom parameters
Use Sensible Defaults : Provide reasonable default values
Group Related Parameters : Use nested configs for related settings
Environment Variables
Dream Trainer respects several environment variables:
# PyTorch distributed settings
export MASTER_ADDR = localhost
export MASTER_PORT = 29500
export WORLD_SIZE = 8
export RANK = 0
# NCCL settings for better performance
export NCCL_DEBUG = INFO
export NCCL_TREE_THRESHOLD = 0
# GPU memory settings
export PYTORCH_CUDA_ALLOC_CONF = expandable_segments:True
Configuration Validation
Dream Trainer validates configurations at runtime:
# These will raise errors:
DeviceParameters (
data_parallel_size = 3 ,
tensor_parallel_size = 2 ,
# Error: Total devices (6) must match available GPUs
)
TrainingParameters (
train_batch_size = 7 ,
# Error: Batch size must be divisible by data parallel size
)
Next Steps