Skill

entropy-sim2real

From asi

Implements entropy-driven sim2real transfer with maximum entropy RL, domain randomization, and information-theoretic alignment to close robotics reality gap.

Python

ai-ml

npx claudepluginhub plurigrid/asi --plugin asi

Tool Access

This skill uses the workspace's default tool permissions.

Preview

**Trit**: -1 (MINUS - analysis/verification)

Supporting Assets

NEIGHBOR_SKILLS.md

SKILL.md

Similar Skills

active-inference-robotics

Synthesizes Patrick Kenny's discrete active inference framework with K-Scale's JAX/MuJoCo robotics stack for predictive coding in robot locomotion.

1 file

asi

stable-baselines3

Guides training RL agents with Stable Baselines3 algorithms (PPO, SAC, DQN, TD3, A2C) using Gymnasium environments, custom env creation, callbacks, and optimization.

7 files

superpowers

stable-baselines3

Guides training RL agents with Stable Baselines3 (PPO, SAC, DQN, TD3, DDPG, A2C), custom Gym environments, callbacks for monitoring, vectorized envs for parallel training, and deep RL workflows.

7 files

scientific-skills

Stats

Parent Repo Stars16

Parent Repo Forks5

Last CommitFeb 16, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Entropy-Driven Sim2Real Transfer

Trit: -1 (MINUS - analysis/verification) Color: #E85B8E (Rose Pink) URI: skill://entropy-sim2real#E85B8E

Core Insight

Entropy bridges the sim-real gap by:

Maximizing entropy in simulation → Policy sees diverse conditions
Minimizing entropy at deployment → Uncertainty collapses to reality
Information-theoretic alignment → Match distributions, not parameters

                    SIMULATION                      REALITY
                    
    High Entropy ─────────────────────────────▶ Low Entropy
    
    H(params) = max     ══════════▶      H(params) ≈ 0
    H(π|s) = high       ══════════▶      H(π|s) = focused
    p(sim) = broad      ══════════▶      p(real) = delta
    
    ┌─────────────────┐                ┌─────────────────┐
    │  MANY POSSIBLE  │    BRIDGE     │   ONE ACTUAL    │
    │     WORLDS      │───────────────│     WORLD       │
    │   (superpos.)   │               │   (collapsed)   │
    └─────────────────┘                └─────────────────┘

Three Entropy Mechanisms

1. Domain Randomization Entropy

Maximize entropy over simulation parameters:

import jax
import jax.numpy as jnp
from typing import Dict

class EntropyMaximizingRandomizer:
    """Domain randomization that maximizes parameter entropy."""
    
    def __init__(self, param_ranges: Dict[str, tuple]):
        self.param_ranges = param_ranges
        
    def entropy(self, distribution: str = "uniform") -> float:
        """Compute entropy of parameter distributions."""
        H = 0.0
        for name, (low, high) in self.param_ranges.items():
            if distribution == "uniform":
                # H(Uniform) = log(b - a)
                H += jnp.log(high - low)
            elif distribution == "gaussian":
                # H(Gaussian) = 0.5 * log(2πeσ²)
                sigma = (high - low) / 4  # 95% within range
                H += 0.5 * jnp.log(2 * jnp.pi * jnp.e * sigma**2)
        return H
    
    def sample(self, key: jax.random.PRNGKey) -> Dict[str, float]:
        """Sample parameters to maximize coverage."""
        params = {}
        for i, (name, (low, high)) in enumerate(self.param_ranges.items()):
            k = jax.random.fold_in(key, i)
            # Uniform maximizes entropy for bounded support
            params[name] = jax.random.uniform(k, minval=low, maxval=high)
        return params
    
    def adaptive_entropy(
        self, 
        key: jax.random.PRNGKey,
        real_samples: jnp.ndarray,
        temperature: float = 1.0
    ) -> Dict[str, float]:
        """
        Adapt randomization to maximize coverage of real distribution.
        
        Uses maximum entropy principle: find distribution with highest
        entropy subject to matching observed moments.
        """
        # Estimate real distribution moments
        real_mean = jnp.mean(real_samples, axis=0)
        real_var = jnp.var(real_samples, axis=0)
        
        # Maximum entropy distribution matching moments = Gaussian
        params = {}
        for i, (name, _) in enumerate(self.param_ranges.items()):
            k = jax.random.fold_in(key, i)
            # Sample from Gaussian matching real moments (max entropy)
            params[name] = jax.random.normal(k) * jnp.sqrt(real_var[i]) + real_mean[i]
        
        return params

2. Maximum Entropy RL

Policy optimization with entropy regularization:

class MaxEntropyPPO:
    """
    PPO with entropy bonus for robust sim2real.
    
    Objective: max E[Σ γᵗ(rₜ + α·H(π(·|sₜ)))]
    
    High entropy → diverse actions → robust to perturbations
    """
    
    def __init__(
        self,
        entropy_coef: float = 0.01,
        target_entropy: float = -1.0,
        auto_tune: bool = True
    ):
        self.alpha = entropy_coef
        self.target_entropy = target_entropy
        self.auto_tune = auto_tune
        
        if auto_tune:
            # Learnable temperature (SAC-style)
            self.log_alpha = jnp.log(entropy_coef)
    
    def policy_entropy(self, logits: jnp.ndarray) -> float:
        """Compute policy entropy H(π) = -Σ π(a)log(π(a))."""
        probs = jax.nn.softmax(logits)
        log_probs = jax.nn.log_softmax(logits)
        return -jnp.sum(probs * log_probs, axis=-1).mean()
    
    def gaussian_entropy(self, std: jnp.ndarray) -> float:
        """Entropy of Gaussian policy: H = 0.5 * log(2πeσ²)."""
        return 0.5 * jnp.log(2 * jnp.pi * jnp.e * std**2).sum(axis=-1).mean()
    
    def entropy_loss(
        self, 
        policy_entropy: float,
        update_alpha: bool = True
    ) -> tuple:
        """
        Compute entropy loss and optionally update temperature.
        
        We want: H(π) ≥ H_target
        Loss: α * (H(π) - H_target)
        """
        entropy_bonus = self.alpha * policy_entropy
        
        if self.auto_tune and update_alpha:
            # Dual gradient descent on temperature
            alpha_loss = -self.log_alpha * (policy_entropy - self.target_entropy)
            return entropy_bonus, alpha_loss
        
        return entropy_bonus, 0.0
    
    def robust_policy_loss(
        self,
        advantages: jnp.ndarray,
        log_probs: jnp.ndarray,
        old_log_probs: jnp.ndarray,
        policy_entropy: float,
        clip_ratio: float = 0.2
    ) -> float:
        """
        PPO loss with entropy regularization.
        
        L = L_clip + α·H(π)
        
        High entropy prevents overconfident policies that
        fail on real hardware.
        """
        # Standard PPO clipped objective
        ratio = jnp.exp(log_probs - old_log_probs)
        clipped = jnp.clip(ratio, 1 - clip_ratio, 1 + clip_ratio)
        policy_loss = -jnp.minimum(ratio * advantages, clipped * advantages).mean()
        
        # Entropy bonus (negative because we minimize loss)
        entropy_bonus = -self.alpha * policy_entropy
        
        return policy_loss + entropy_bonus

3. Information-Theoretic Bridging

Minimize information gap between sim and real:

class InformationTheoreticBridge:
    """
    Bridge sim and real via information-theoretic measures.
    
    Key insight: We can't match physics exactly, but we can
    match the *information content* of observations.
    """
    
    def mutual_information(
        self,
        sim_obs: jnp.ndarray,
        real_obs: jnp.ndarray
    ) -> float:
        """
        Estimate I(sim; real) - how much sim tells us about real.
        
        High MI = sim is predictive of real (good!)
        Low MI = sim and real are independent (bad!)
        """
        # Use MINE estimator or simple correlation
        joint_cov = jnp.cov(sim_obs.T, real_obs.T)
        n = sim_obs.shape[1]
        cov_sim = joint_cov[:n, :n]
        cov_real = joint_cov[n:, n:]
        cov_joint = joint_cov
        
        # MI = 0.5 * log(|Σ_sim||Σ_real| / |Σ_joint|)
        mi = 0.5 * (
            jnp.linalg.slogdet(cov_sim)[1] +
            jnp.linalg.slogdet(cov_real)[1] -
            jnp.linalg.slogdet(cov_joint)[1]
        )
        return mi
    
    def domain_divergence(
        self,
        sim_obs: jnp.ndarray,
        real_obs: jnp.ndarray,
        method: str = "wasserstein"
    ) -> float:
        """
        Measure divergence between sim and real distributions.
        
        Lower divergence = better sim2real transfer.
        """
        if method == "kl":
            # KL(real || sim) - how surprised is sim by real?
            # Requires density estimation
            pass
            
        elif method == "wasserstein":
            # W_2 distance (optimal transport)
            mu_sim = jnp.mean(sim_obs, axis=0)
            mu_real = jnp.mean(real_obs, axis=0)
            cov_sim = jnp.cov(sim_obs.T)
            cov_real = jnp.cov(real_obs.T)
            
            # W_2² = ||μ_sim - μ_real||² + Tr(Σ_sim + Σ_real - 2(Σ_sim^½ Σ_real Σ_sim^½)^½)
            mean_diff = jnp.sum((mu_sim - mu_real)**2)
            
            # Simplified: use Frobenius norm of covariance difference
            cov_diff = jnp.sum((cov_sim - cov_real)**2)
            
            return jnp.sqrt(mean_diff + cov_diff)
            
        elif method == "mmd":
            # Maximum Mean Discrepancy
            from functools import partial
            
            def rbf_kernel(x, y, sigma=1.0):
                return jnp.exp(-jnp.sum((x - y)**2) / (2 * sigma**2))
            
            n, m = len(sim_obs), len(real_obs)
            
            # MMD² = E[k(x,x')] + E[k(y,y')] - 2E[k(x,y)]
            xx = jnp.mean(jax.vmap(lambda x: jax.vmap(lambda x2: rbf_kernel(x, x2))(sim_obs))(sim_obs))
            yy = jnp.mean(jax.vmap(lambda y: jax.vmap(lambda y2: rbf_kernel(y, y2))(real_obs))(real_obs))
            xy = jnp.mean(jax.vmap(lambda x: jax.vmap(lambda y: rbf_kernel(x, y))(real_obs))(sim_obs))
            
            return xx + yy - 2 * xy
    
    def entropy_matching_loss(
        self,
        sim_obs: jnp.ndarray,
        real_obs: jnp.ndarray
    ) -> float:
        """
        Match entropy profiles between sim and real.
        
        If H(sim) >> H(real): sim too noisy, reduce randomization
        If H(sim) << H(real): sim too deterministic, increase randomization
        """
        def estimate_entropy(obs):
            # Estimate via covariance determinant (Gaussian assumption)
            cov = jnp.cov(obs.T)
            return 0.5 * jnp.linalg.slogdet(cov)[1]
        
        H_sim = estimate_entropy(sim_obs)
        H_real = estimate_entropy(real_obs)
        
        return (H_sim - H_real)**2

The Entropy Bridge Pipeline

┌────────────────────────────────────────────────────────────────────┐
│                    ENTROPY-DRIVEN SIM2REAL                         │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  PHASE 1: Maximum Entropy Simulation                               │
│  ────────────────────────────────────                              │
│                                                                     │
│    Domain Params         Policy                 Observations       │
│    ┌─────────────┐      ┌─────────────┐        ┌─────────────┐    │
│    │ H(θ) = max  │ ───▶ │ H(π|s) = αT │ ───▶  │ H(o) = high │    │
│    │ friction ∈  │      │ explore all │        │ diverse     │    │
│    │ [0.3, 1.5]  │      │ actions     │        │ experiences │    │
│    │ mass ∈      │      └─────────────┘        └─────────────┘    │
│    │ [0.8, 1.2]  │                                                 │
│    └─────────────┘                                                 │
│                                                                     │
│  PHASE 2: Information Bridge                                       │
│  ───────────────────────────                                       │
│                                                                     │
│    Sim Distribution        Divergence          Real Distribution   │
│    ┌─────────────┐        ┌─────────────┐     ┌─────────────┐     │
│    │  p(o|sim)   │ ──────▶│ W(sim,real) │◀─── │  p(o|real)  │     │
│    │  (broad)    │        │ minimize    │     │  (narrow)   │     │
│    └─────────────┘        └─────────────┘     └─────────────┘     │
│                                  │                                  │
│                           Adapt randomization                      │
│                           to match real entropy                    │
│                                                                     │
│  PHASE 3: Entropy Collapse at Deployment                          │
│  ────────────────────────────────────────                          │
│                                                                     │
│    Policy trained on      Deployed on          Result              │
│    ┌─────────────┐       ┌─────────────┐      ┌─────────────┐     │
│    │ ALL possible│  ───▶ │ ONE actual  │ ───▶ │ ROBUST to   │     │
│    │ worlds      │       │ world       │      │ any world   │     │
│    │ (superpos.) │       │ (collapsed) │      │ in support  │     │
│    └─────────────┘       └─────────────┘      └─────────────┘     │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Integration with K-Scale Stack

from ksim import PPOTask, PhysicsRandomizer
from ksim.randomizers import (
    StaticFrictionRandomizer,
    MassMultiplicationRandomizer,
    JointDampingRandomizer,
)

class EntropyBridgedKBotTask(PPOTask):
    """K-Bot training with entropy-driven sim2real."""
    
    # High-entropy domain randomization
    physics_randomizers = [
        StaticFrictionRandomizer(scale=0.5),      # Wide friction range
        MassMultiplicationRandomizer(             # Body mass variation
            body_name="torso",
            scale=0.2
        ),
        JointDampingRandomizer(scale=0.3),        # Damping variation
        # ... more randomizers for max entropy
    ]
    
    # Max-entropy RL config
    entropy_coef = 0.02      # High entropy bonus
    target_entropy = -4.0    # Automatic temperature tuning
    
    def compute_entropy_metrics(self, trajectory):
        """Track entropy throughout training."""
        policy_entropy = self.policy.entropy(trajectory.obs)
        obs_entropy = self.estimate_obs_entropy(trajectory.obs)
        
        return {
            "policy_entropy": policy_entropy,
            "observation_entropy": obs_entropy,
            "entropy_ratio": policy_entropy / obs_entropy,
        }
    
    def adapt_randomization(self, real_data):
        """
        Adapt domain randomization to match real robot entropy.
        
        This is the key insight: we don't try to match exact
        parameters, we match the *entropy profile*.
        """
        sim_obs = self.collect_sim_observations()
        real_obs = real_data.observations
        
        # Compute entropy gap
        H_sim = self.estimate_entropy(sim_obs)
        H_real = self.estimate_entropy(real_obs)
        
        if H_sim > H_real * 1.5:
            # Sim too noisy, reduce randomization
            self.reduce_randomization_scale(0.9)
        elif H_sim < H_real * 0.7:
            # Sim too deterministic, increase randomization
            self.increase_randomization_scale(1.1)
        
        # Match distribution via Wasserstein
        W = self.wasserstein_distance(sim_obs, real_obs)
        self.log("wasserstein_distance", W)

Why Entropy Works for Sim2Real

1. Coverage Guarantee

If policy π is optimal for ALL sims in support of p(sim),
and real world ∈ support of p(sim),
then π works in real world.

Key: Entropy maximization → widest possible support

2. Robustness via Exploration

High H(π|s) → policy doesn't overfit to single solution
            → maintains multiple viable strategies
            → can adapt when reality differs

3. Information Bottleneck

Sim and real share mutual information I(sim; real)
Maximize I → sim captures what matters about real
Ignore I → overfit to sim-specific artifacts

GF(3) Triads

entropy-sim2real (-1) ⊗ kos-firmware (+1) ⊗ mujoco-scenes (0) = 0 ✓
entropy-sim2real (-1) ⊗ jaxlife-open-ended (+1) ⊗ wobble-dynamics (0) = 0 ✓
ksim-rl (-1) ⊗ kos-firmware (+1) ⊗ entropy-sim2real (-1) = needs +1

Related Skills

ksim-rl (-1): Base RL training
kos-firmware (+1): Deployment target
ergodicity (0): Ergodic theory foundations
birkhoff-average (-1): Time averages
fokker-planck-analyzer (-1): Distribution dynamics

References

@article{haarnoja2018sac,
  title={Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL},
  author={Haarnoja, Tuomas and others},
  journal={ICML},
  year={2018}
}

@article{tobin2017domain,
  title={Domain Randomization for Transferring Deep Neural Networks},
  author={Tobin, Josh and others},
  journal={IROS},
  year={2017}
}

@article{zhao2020sim,
  title={Sim-to-Real Transfer in Deep Reinforcement Learning},
  author={Zhao, Wenshuai and others},
  journal={IEEE TNNLS},
  year={2020}
}