This guide covers executing the zk0 federated learning simulation, including default and alternative methods, output details, and troubleshooting.
| Installation Guide | Architecture Overview | Node Operators |
By default, the training script uses the conda zk0 environment for fast and flexible execution. This provides direct access to host resources while maintaining reproducibility.
# Activate environment (if not already)
conda activate zk0
# Run federated learning (uses pyproject.toml defaults: 1 round, 2 steps/epochs, serialized GPU)
./train.sh
ls
# Or direct Flower run with overrides
conda run -n zk0 flwr run . local-simulation-serialized-gpu --run-config "num-server-rounds=5 local-epochs=10"
# Activate first, then run
conda activate zk0
flwr run . local-simulation-serialized-gpu --run-config "num-server-rounds=5 local-epochs=10"
✅ Validated Alternative: Conda execution has been tested and works reliably for federated learning runs, providing a simpler setup for development environments compared to Docker.
For reproducible and isolated execution, use the --docker flag or run directly with Docker. This ensures consistent environments and eliminates SafeTensors multiprocessing issues.
The train.sh script runs with configuration from pyproject.toml (defaults: 1 round, 2 steps/epochs for quick tests). Uses conda by default, with --docker flag for Docker execution.
# Basic usage with conda (default)
./train.sh
# Detached mode (anti-hang rule - prevents VSCode client crashes from stopping training)
./train.sh --detached
# Use Docker instead of conda
./train.sh --docker
# Detached mode with Docker
./train.sh --docker --detached
# For custom config, use direct Flower run with overrides
flwr run . local-simulation-serialized-gpu --run-config "num-server-rounds=5 local-epochs=10"
# Or with Docker directly (example with overrides)
docker run --gpus all --shm-size=10.24gb \
-v $(pwd):/workspace \
-v $(pwd)/outputs:/workspace/outputs \
-v /tmp:/tmp \
-v $HOME/.cache/huggingface:/home/user_lerobot/.cache/huggingface \
-w /workspace \
zk0 flwr run . local-simulation-serialized-gpu --run-config "num-server-rounds=5"
[tool.flwr.app.config] in pyproject.toml for defaults (e.g., num-server-rounds=1, local-epochs=2).local-simulation-serialized-gpu for reliable execution (prevents SafeTensors issues; max-parallelism=1).local-simulation-gpu for parallel execution (may encounter SafeTensors issues).eval-frequency in pyproject.toml (0 = every round).--detached flag to prevent VSCode client crashes from stopping training (anti-hang rule).--docker flag for reproducible, isolated execution when needed.local-simulation-serialized-gpu for reliable execution (prevents SafeTensors multiprocessing conflicts).--gpus all flag (Docker).Results of training steps for each client and server logs will be under the outputs/ directory. For each run there will be a subdirectory corresponding to the date and time of the run. For example:
outputs/date_time/
├── simulation.log # Unified logging output (all clients, server, Flower, Ray)
├── server/ # Server-side outputs
│ ├── server.log # Server-specific logs
│ ├── eval_policy_loss_chart.png # 📊 AUTOMATIC: Line chart of per-client and server avg policy loss over rounds
│ ├── eval_policy_loss_history.json # 📊 AUTOMATIC: Historical policy loss data for reproducibility
│ ├── round_N_server_eval.json # Server evaluation results
│ ├── federated_metrics.json # Aggregated FL metrics
│ └── federated_metrics.png # Metrics visualization
├── clients/ # Client-side outputs
│ └── client_N/ # Per-client directories
│ ├── client.log # Client-specific logs
│ └── round_N.json # Client evaluation metrics (policy_loss, etc.)
└── models/ # Saved model checkpoints
└── checkpoint_round_N.safetensors
The system automatically generates comprehensive evaluation charts at the end of each training session:
eval_policy_loss_chart.png: Interactive line chart showing:
eval_policy_loss_history.json: Raw data for reproducibility and analysis:
No manual steps required - charts appear automatically after training completion. The charts use intuitive client IDs (0-3) instead of long Ray/Flower identifiers for better readability.
The system automatically saves model checkpoints during federated learning to preserve trained models for deployment and analysis. To optimize disk usage, local checkpoints are gated by intervals, and HF Hub pushes are restricted to substantial runs.
checkpoint_interval in pyproject.toml (default: 10), plus final round always savednum_server_rounds >= checkpoint_interval (avoids cluttering repos with tiny/debug runs).safetensors weights, config, README, and metrics for HF Hub compatibilityoutputs/YYYY-MM-DD_HH-MM-SS/models/ directoryoutputs/2025-01-01_12-00-00/models/
├── checkpoint_round_10/ # After round 10 (interval hit)
├── checkpoint_round_20/ # After round 20 (interval hit)
├── checkpoint_round_30/ # After round 30 (interval hit)
...
├── checkpoint_round_250/ # Final model (always saved)
outputs/2025-01-01_12-00-00/models/
├── checkpoint_round_2/ # Final model only (always saved)
# No intermediate checkpoints; no HF push (2 < 10)
[tool.flwr.app.config]
checkpoint_interval = 10 # Save local checkpoint every N rounds + final (default: 10)
hf_repo_id = "username/zk0-smolvla-fl" # Optional: Push final model to Hugging Face Hub (only if num_server_rounds >= checkpoint_interval)
hf_repo_id configured AND num_server_rounds >= checkpoint_intervalHF_TOKEN environment variable for Hub access# Load a saved checkpoint for inference
from safetensors.torch import load_file
from src.task import get_model # Assuming get_model is available
# Load model architecture
checkpoint_path = "outputs/2025-01-01_12-00-00/models/checkpoint_round_20.safetensors"
state_dict = load_file(checkpoint_path)
# Create model and load weights
model = get_model(dataset_meta) # dataset_meta from your config
model.load_state_dict(state_dict)
model.eval()
# Use for inference
with torch.no_grad():
predictions = model(input_data)
No manual intervention required - model checkpoints are saved automatically during training and can be used for deployment, analysis, or continued training.
To troubleshoot restarts (e.g., PSU overload), use sys_monitor_logs.sh:
./sys_monitor_logs.sh before training.| Post-restart: tail -n 100 gpu_monitor.log | grep power to check spikes. |
./train.sh --detached to isolate training in tmux sessions (anti-hang rule). VSCode client crashes won’t stop training processes.tmux ls to list sessions, tmux attach -t <session-name> to monitor, tmux kill-session -t <session-name> to stop.-v $(pwd)/outputs:/workspace/outputs).local-simulation-serialized-gpu for reliable execution.pytest -n auto for parallel testing (see DEVELOPMENT.md).local-simulation-serialized-gpu or Docker for isolation.nvidia-smi output.For advanced troubleshooting, check simulation.log in outputs or consult TECHNICAL-OVERVIEW.md.
If issues persist, ensure you’re following the constraints in INSTALLATION.md and the memory bank in .kilocode/rules/memory-bank/.