Installation

Environment Preferences

Architecture Overview | Node Operators | Running Simulations

Standard Installation

  1. Create the zk0 environment:
    conda create -n zk0 python=3.10 -y
    conda activate zk0
    
  2. Install CUDA-enabled PyTorch (for GPU support):
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130 --no-cache-dir
    
  3. Install LeRobot (latest version, manually before project install):
    pip install lerobot[smolvla]==0.3.3
    
  4. Install project dependencies from pyproject.toml:
    pip install -e .
    
  5. Verify GPU:
    python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
    
    • Expected: True.

For running instructions, see docs/RUNNING.

Result Output

Results of training steps for each client and server logs will be under the outputs/ directory. For each run there will be a subdirectory corresponding to the date and time of the run. For example:

outputs/date_time/
├── simulation.log       # Unified logging output (all clients, server, Flower, Ray)
├── server/              # Server-side outputs
│   ├── server.log       # Server-specific logs
│   ├── eval_policy_loss_chart.png      # 📊 AUTOMATIC: Line chart of per-client and server avg policy loss over rounds
│   ├── eval_policy_loss_history.json   # 📊 AUTOMATIC: Historical policy loss data for reproducibility
│   ├── round_N_server_eval.json        # Server evaluation results
│   ├── federated_metrics.json          # Aggregated FL metrics
│   └── federated_metrics.png           # Metrics visualization
├── clients/             # Client-side outputs
│   └── client_N/        # Per-client directories
│       ├── client.log   # Client-specific logs
│       └── round_N.json # Client evaluation metrics (policy_loss, etc.)
└── models/              # Saved model checkpoints
    └── checkpoint_round_N/  # Complete HF-compatible directory
        ├── model.safetensors          # Model weights in safetensors format
        ├── config.json               # Model configuration
        ├── README.md                 # Auto-generated model card with training details
        ├── metrics.json              # Training metrics and insights
        ├── tokenizer.json            # Tokenizer configuration
        ├── tokenizer_config.json     # Tokenizer settings
        ├── special_tokens_map.json   # Special token mappings
        ├── vocab.json                # Vocabulary
        ├── merges.txt                # BPE merges (if applicable)
        ├── generation_config.json    # Text generation settings
        ├── preprocessor_config.json  # Input preprocessing config
        ├── policy_preprocessor.json  # SmolVLA policy preprocessor
        └── policy_postprocessor.json # SmolVLA policy postprocessor

📊 Automatic Evaluation Chart Generation

The system automatically generates comprehensive evaluation charts at the end of each training session:

No manual steps required - charts appear automatically after training completion. The charts use intuitive client IDs (0-3) instead of long Ray/Flower identifiers for better readability.

💾 Automatic Model Checkpoint Saving

The system automatically saves model checkpoints during federated learning to preserve trained models for deployment and analysis. Each checkpoint is a complete Hugging Face Hub-compatible directory.

Checkpoint Saving Configuration

Example Checkpoint Files

outputs/2025-01-01_12-00-00/models/
├── checkpoint_round_5/     # After round 5
├── checkpoint_round_10/    # After round 10
└── checkpoint_round_20/    # Final model (end of training)

Checkpoint Directory Structure

Each checkpoint is saved as a complete directory containing all Hugging Face Hub-compatible files:

checkpoint_round_N/
├── model.safetensors          # Model weights in safetensors format
├── config.json               # Model configuration
├── README.md                 # Auto-generated model card with training details
├── metrics.json              # Training metrics and insights
├── tokenizer.json            # Tokenizer configuration
├── tokenizer_config.json     # Tokenizer settings
├── special_tokens_map.json   # Special token mappings
├── vocab.json                # Vocabulary
├── merges.txt                # BPE merges (if applicable)
├── generation_config.json    # Text generation settings
├── preprocessor_config.json  # Input preprocessing config
├── policy_preprocessor.json  # SmolVLA policy preprocessor
└── policy_postprocessor.json # SmolVLA policy postprocessor

Configuration Options

[tool.flwr.app.config]
checkpoint_interval = 5  # Save checkpoint every 5 rounds (0 = disabled)
hf_repo_id = "username/zk0-smolvla-federated"  # Optional: Push final model to Hugging Face Hub

Hugging Face Hub Integration

Using Saved Models

# Load a saved checkpoint for inference
from safetensors.torch import load_file
from src.task import get_model  # Assuming get_model is available

# Load model architecture
checkpoint_path = "outputs/2025-01-01_12-00-00/models/checkpoint_round_20/model.safetensors"
state_dict = load_file(checkpoint_path)

# Create model and load weights
model = get_model(dataset_meta)  # dataset_meta from your config
model.load_state_dict(state_dict)
model.eval()

# Use for inference
with torch.no_grad():
    predictions = model(input_data)

No manual intervention required - model checkpoints are saved automatically during training and can be used for deployment, analysis, or continued training.

Troubleshooting

For advanced troubleshooting, check simulation.log in outputs or consult TECHNICAL-OVERVIEW.

If issues persist, ensure you’re following the constraints in INSTALLATION and the development guidelines in DEVELOPMENT.

For other environments with torch CUDA issues, use the same pip install command with the appropriate CUDA version (e.g., cu121 for CUDA 12.1).