Architecture

This document provides a detailed overview of the zk0 project’s architecture (v0.5.1), focusing on the federated learning system for training SmolVLA models on SO-100 robotics datasets. It adapts key concepts from the project’s implementation and incorporates advanced technical details, comparisons, reproducibility guidelines, evaluation mechanisms, and hyperparameter analysis.

Installation Guide Node Operators Running Simulations

Overview

The zk0 project implements a federated learning architecture using the Flower framework with SmolVLA models for robotics AI tasks. The system follows a client-server model where multiple clients train models locally on private SO-100 datasets, and a central server coordinates aggregation and evaluation. This ensures privacy-preserving distributed training while achieving performance comparable to centralized approaches.

Key goals:

The architecture is modular, drawing from Flower’s quickstart-lerobot example but adapted for SmolVLA and multi-repo datasets.

Directory Structure

zk0/
├── .github/                     # GitHub workflows and configurations
├── .kilocode/                   # Kilo Code configuration and rules
├── docs/                        # Project documentation and images
│   └── images/                  # Images for documentation
├── src/                         # Source code modules
│   ├── configs/                 # Dataset and configuration files
│   └── server/                  # Server utilities and strategies
├── tests/                       # Unit and integration test suites
│   ├── integration/             # End-to-end federated learning tests
│   └── unit/                    # Individual component tests
└── outputs/                     # Generated outputs from runs

Core Components

Client Layer

Server Layer

Communication Layer

Training Strategy

Federated Learning Setup

Data Flow

  1. Initialization: Server loads/distributes initial SmolVLA model.
  2. Assignment: Clients receive unique SO-100 subsets.
  3. Local Training: Clients train (all episodes, 50 epochs/round default).
  4. Upload: Send updates to server.
  5. Aggregation: Server combines via FedProx.
  6. Update: Broadcast global model.
  7. Server Eval: Test on dedicated datasets (first N episodes).
  8. Repeat: For configured rounds (e.g., 30).

Data Flow Diagram

The following Mermaid diagram illustrates the high-level data flow in the federated learning process:

flowchart TD A[Server Initialization
Load Initial SmolVLA Model] --> B[Distribute Global Model
to Clients] B --> C[Client Local Training
SO-100 Datasets + FedProx] C --> D[Upload Model Updates
Secure Parameter Exchange] D --> E[Server Aggregation
FedProx Strategy] E --> F[Server Evaluation
Unseen SO-101 Datasets
Policy Loss Metric] F --> G{Continue Rounds?} G -->|Yes| B G -->|No| H[End
Save Final Model] style A fill:#BBDEFB,stroke:#1976D2,stroke-width:2px,color:#000 style E fill:#E1BEE7,stroke:#7B1FA2,stroke-width:2px,color:#000 style F fill:#C8E6C9,stroke:#388E3C,stroke-width:2px,color:#000 style H fill:#FFCDD2,stroke:#D32F2F,stroke-width:2px,color:#000

This diagram captures the iterative cycle: model distribution, local training, aggregation, evaluation, and repetition across configured rounds (e.g., 30 rounds).

Production Mode Architecture (v0.4.0)

zk0 v0.4.0 introduces production-ready deployment capabilities, enabling secure, multi-node federated learning with privacy-preserving client training. This extends the simulation architecture with Docker-based orchestration and the zk0bot CLI for node operators.

Production Mode Data Flow Diagram

The following Mermaid diagram illustrates the production mode data flow, highlighting Docker Compose orchestration and zk0bot CLI integration:

graph TD
    A["zk0bot CLI Installation<br/>curl -fsSL https://get.zk0.bot | bash"] --> B{"Mode Selection"}
    B -->|Server Admin| C["zk0bot server start<br/>Docker Compose: SuperLink + ServerApp"]
    B -->|Node Operator| D["zk0bot client start --dataset URI<br/>Docker Compose: SuperNode + ClientApp"]
    C --> E["Server APIs Exposed<br/>Ports 9091-9093<br/>Fleet Management"]
    D --> F["Private Dataset Mount<br/>HF or Local via URI<br/>UUID Anonymization"]
    E --> G["Accept Client Connections<br/>Dynamic node_id Assignment<br/>Secure Parameter Exchange"]
    F --> H["Connect to Server<br/>Load Dataset from URI<br/>Local Training with FedProx"]
    H --> I["Report Anonymized Metrics<br/>node_id + dataset_uuid<br/>Model Updates Only"]
    G --> J["Aggregate Updates<br/>Server-Side Evaluation<br/>WandB Public Aggregates"]
    I --> J
    J --> K["End Round<br/>Restart for New Dataset<br/>No Raw Data Shared"]
    style A fill:#BBDEFB,stroke:#1976D2,stroke-width:2px,color:#000
    style C fill:#E1BEE7,stroke:#7B1FA2,stroke-width:2px,color:#000
    style D fill:#C8E6C9,stroke:#388E3C,stroke-width:2px,color:#000
    style J fill:#FFCDD2,stroke:#D32F2F,stroke-width:2px,color:#000
    style K fill:#FFECB3,stroke:#F57F17,stroke-width:2px,color:#000

This diagram shows the production workflow: CLI installation, mode-specific startup via Docker Compose, secure client-server communication, and privacy-focused metrics reporting.

Simulation vs. Production Mode Differences

Aspect Simulation Mode Production Mode
Execution Local Ray clients (fixed 4) Docker Compose (SuperLink + SuperNodes)
Networking Localhost/loopback External IPs, ports 9091-9093, insecure mode with external VPN (Tailscale/WebRTC)
Dataset Loading Partitioned from pyproject.toml From run_config URI (HF repo_id or local root)
Client ID Fixed partition_id (0-3) Persistent context.cid (SuperNode lifetime)
Metrics Direct dataset names Anonymized: dataset_name + uuid or cid
Persistence Ephemeral (in-memory) Volumes for models/checkpoints/datasets
Scaling Fixed clients Dynamic SuperNodes, Kubernetes-ready
Monitoring Local logs/WandB Prometheus/Grafana + WandB aggregates
Auth/Security None (local) Network-level (VPN); app-level insecure
CLI Integration train-fl-simulation.sh zk0bot server/client start/stop/log/status
Onboarding Manual config GitHub issue template + Discord approval

zk0bot CLI Integration

The zk0bot CLI provides a user-friendly interface for production operations, wrapping Docker Compose for server and client management.

Key Features

Integration with Architecture

Example Workflow

  1. Server Admin:
    zk0bot server start  # Exposes APIs on localhost:9091-9093
    zk0bot server status  # Verify running
    
  2. Node Operator:
    zk0bot client start hf:myuser/private-so100  # Connects to server, trains locally
    zk0bot client log  # Monitor training
    
  3. Onboarding: Apply via GitHub issue; approved operators get Discord access.

This CLI abstraction simplifies deployment while maintaining the core FL architecture. For full details, see docs/NODE-OPERATORS.md.

Federated vs. Centralized Training Comparison

The zk0 system enables rigorous benchmarking between federated and centralized training to evaluate privacy-efficiency trade-offs.

Objective Performance Benchmarking

Federated Learning Characteristics

Metric Federated (Best Config)
Final Policy Loss <0.15 target (v0.2.6 enhancements)
Convergence Rounds 30-50 (warm restarts prevent plateaus)
Training Efficiency 1.0 (adaptive LR engages all clients)
Privacy High (parameters only)
Scalability Horizontal (10+ clients; dynamic mu stabilizes)

Example metrics from best FL config (50 rounds, 20 epochs, μ=0.01, LR=0.0005 with dynamic decay):

Advanced LR/MU Scheduling (v0.2.6)

This section provides a thorough explanation of the dynamic decay enhancements introduced in v0.2.6, building on the base FedProx strategy from v0.2.5. These enhancements address key challenges in federated learning with heterogeneous SO-100 datasets: plateaus in convergence, client disengagement due to varying task difficulties, and instability from loss spikes. The implementation is modular, with small, single-responsibility functions for maintainability, and includes comprehensive unit tests (90%+ coverage for new code).

The enhancements are configurable via pyproject.toml under [tool.flwr.app.config], with validation in src/utils.py. All changes are backward-compatible with v0.2.5 (default to cosine scheduler if new params unset).

Design Rationale

Scheduler Types

The scheduler factory in src/task.py:create_scheduler supports three types, selected via scheduler_type. Defaults to “cosine” for compatibility.

  1. CosineAnnealingLR (Default):
    • Description: Standard cosine decay from initial LR to eta_min over epochs.
    • Use Case: Baseline for stable, monotonic decay.
    • Configuration:
      • scheduler_type = "cosine"
      • eta_min = 5e-7 (minimum LR to prevent vanishing gradients).
    • Implementation:
      from torch.optim.lr_scheduler import CosineAnnealingLR
      scheduler = CosineAnnealingLR(optimizer, T_max=epochs, eta_min=eta_min)
      
    • Behavior: LR starts at initial_lr (e.g., 5e-4) and decays smoothly. Tested in test_create_scheduler (verifies T_max, eta_min).
    • Benefits: Simple, prevents overfitting in later epochs.
  2. CosineAnnealingWarmRestarts:
    • Description: Cosine decay with periodic “warm restarts” to full initial LR every T_0 rounds, multiplied by T_mult.
    • Use Case: Escapes local minima/plateaus (e.g., post-R20 stalls in v0.2.5 runs).
    • Configuration:
      • scheduler_type = "cosine_warm_restarts"
      • cosine_warm_restarts_T_0 = 15 (restart every 15 rounds).
      • cosine_warm_restarts_T_mult = 2 (period doubles: 15→30→60…; integer for PyTorch compatibility).
      • eta_min = 5e-7.
    • Implementation:
      from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
      scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=T_0, T_mult=T_mult, eta_min=eta_min)
      
    • Behavior: LR decays within each cycle, resets to initial at end of T_0, with lengthening cycles. E.g., for T_0=15, first cycle decays over 15 epochs, second over 30.
    • Testing: test_create_cosine_warm_restarts_scheduler verifies T_0, T_mult (integer), eta_min. Handles non-integer T_mult by defaulting to 1 (PyTorch requirement).
    • Benefits: “Jolts” exploration without full reset, improving convergence by 15-20% in heterogeneous setups (based on v0.2.5 baselines).
  3. ReduceLROnPlateau:
    • Description: Reduces LR by factor (0.5) if loss doesn’t improve for stall_patience rounds.
    • Use Case: Adaptive response to stalls (e.g., loss plateau >5 rounds).
    • Configuration:
      • scheduler_type = "reduce_on_plateau"
      • stall_patience = 5 (rounds without improvement before reduction).
      • eta_min = 5e-7 (floor).
    • Implementation:
      from torch.optim.lr_scheduler import ReduceLROnPlateau
      scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=stall_patience, min_lr=eta_min)
      scheduler.step(loss)  # Called after each round's evaluation
      
    • Behavior: Monitors server policy loss; reduces LR if no improvement (mode=’min’). Resets counter on improvement.
    • Testing: test_create_reduce_on_plateau_scheduler verifies patience, min_lrs (list in PyTorch).
    • Benefits: Responsive to real-time trends, complements warm restarts for long runs.

Adaptive LR Boosts (Per-Client)

Dynamic Mu Adjustment (Server-Side)

Spike Detection and Safeguards

Global LR Adjustment (Server-Side)

Context Preparation and Client Integration

Validation and Monitoring

Backward Compatibility and Defaults

Performance Impact (v0.2.6 vs. v0.2.5)

For code, see src/task.py (client-side), src/server_app.py (server-side). Full tests in tests/unit/. Configuration in pyproject.toml.

Hyperparameter Analysis

Overview

This section analyzes the dynamic learning rate (LR) and MU (FedProx proximal term) decay mechanisms implemented in v0.3.11, based on the 2025-10-19 runs and later enhancements. The analysis demonstrates effective handling of heterogeneous robotics FL, with focus on initial LR impact, successful pipeline validation, and advanced scheduling for 100% client engagement and 89% loss reduction.

Dynamic LR (Cosine Warm Restarts)

MU Decay (FedProx Personalization)

Initial LR Comparison (2025-10-19 Runs)

Initial LR Final Policy Loss (r50) Stability (Std Dev r1-50) Initial Loss (r1) Notes
5e-4 0.997 1.82 (volatile) 9.165 Aggressive updates; oscillation post-r20; higher param norms.
1e-4 0.532 0.11 (stable) 0.298 Smooth convergence; 47% better final; recommended for heterogeneous SO-100.
1e-4 (dynamic v0.3.11) 0.495 (r250) 0.05 (stable) 0.298 Extended convergence with warm restarts and adaptive boosts; 89% loss reduction, 100% client engagement.

Note: History file is policy_loss_history.json (unordered round keys with server_policy_loss/action_dim); use for trend analysis alongside federated_metrics.json.

Key Insights

Data Source and Loading

SO-100/SO-101 Composition

Loading Mechanism

Example:

dataset = LeRobotDataset(repo_id="lerobot/svla_so100_pickplace", tolerance_s=0.0001)

Pretrained Model Initialization

Data Partitioning

Federated Model Aggregation

Progress Demonstration

Evaluation Videos

zk0 captures episodic performance via videos to visualize SmolVLA progress on SO-100 tasks.

Video Generation

Example code snippet:

# In evaluation loop
frames = []  # Collect RGB frames + action overlays
for step in range(max_steps):
    action = model.predict(observation)
    frame = env.render(mode="rgb_array")  # 224x224 with annotations
    frames.append(frame)
    env.step(action)

# Save video
import imageio
imageio.mimsave(f"outputs/evaluate/round_{round}_client_{cid}.mp4", frames, fps=30)

Playback and Analysis

Manual Playback

# List videos by round
find outputs/ -name "*.mp4" | sort

# Play example
vlc outputs/2025-10-11_16-00-00/evaluate/round_10/client_0/rollout_20251011_160500.mp4

# Batch view (e.g., progression)
for video in outputs/*/evaluate/round_*/client_0/*.mp4; do
    echo "Round $(basename $(dirname $video))"; vlc "$video" --play-and-exit;
done

Automated Analysis

# analyze_videos.py example
import cv2
import json
from pathlib import Path

def analyze_video(video_path):
    cap = cv2.VideoCapture(str(video_path))
    frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    fps = cap.get(cv2.CAP_PROP_FPS)
    duration = frames / fps
    # Detect success (e.g., via metadata or frame analysis)
    success = check_final_frame(cap)  # Custom logic
    return {"duration": duration, "frames": frames, "success": success}

# Batch analysis
video_dir = Path("outputs/2025-10-11_16-00-00/evaluate")
results = {}
for round_dir in sorted(video_dir.glob("round_*")):
    round_results = [analyze_video(v) for v in round_dir.glob("*.mp4")]
    results[round_dir.name] = {
        "avg_success": sum(r["success"] for r in round_results) / len(round_results),
        "avg_duration": sum(r["duration"] for r in round_results) / len(round_results)
    }

with open("video_analysis.json", "w") as f:
    json.dump(results, f, indent=2)

Reproducing Experiments

zk0 emphasizes reproducibility with seeds, pinned dependencies, and scripted workflows. This ensures consistent results across environments.

Environment Setup for Reproduction

# Pinned deps ensure consistency
pip install -e .  # Installs from pyproject.toml (Flower 1.22.0, LeRobot 0.3.3, etc.)

# Set reproducible seed
export PYTHONHASHSEED=42
export CUDA_LAUNCH_BLOCKING=1  # For CUDA determinism

Federated Learning Reproduction

# Reproducible FL run
conda activate zk0
flwr run . local-simulation-serialized-gpu \
    --run-config "num-server-rounds=30 local-epochs=50 batch-size=64 seed=42" \
    --seed 42

Centralized Training Baseline

For fair comparison, run equivalent centralized training:

# centralized_baseline.py (example script)
import torch
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from transformers import AutoModelForVision2Seq

# Reproducible setup
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

# Load full dataset (no partitioning)
dataset = LeRobotDataset("lerobot/svla_so100_pickplace", split="train")  # Or aggregate clients

# Model and optimizer (match FL)
model = AutoModelForVision2Seq.from_pretrained("lerobot/smolvla_base")
optimizer = torch.optim.Adam([p for p in model.parameters() if p.requires_grad], lr=1e-4)

# Train for equivalent steps (30 rounds * 50 epochs * batches)
total_steps = 30000  # Adjust based on batch_size
for step in range(total_steps):
    # Training loop (match FL: policy loss, scheduler reset equivalent)
    batch = next(iter(dataset_dataloader))
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# Save for comparison
torch.save(model.state_dict(), "centralized_checkpoint.pt")

Technical Decisions

Framework Selection

Key Patterns

Scalability & Performance

Planned Enhancements

Current Config (v0.2.6)

For implementation details, see source files like src/task.py for training/eval logic.

References