Skill

evla-vla

From asi

Standardizes Open-X Embodiment robotics datasets (DROID, Bridge, LIBERO, RT-X) for training and deploying EdgeVLA vision-language-action models on edge devices.

ai-ml

npx claudepluginhub plurigrid/asi --plugin asi

Tool Access

This skill uses the workspace's default tool permissions.

Preview

**Trit**: -1 (MINUS - analysis/verification)

SKILL.md

Similar Skills

kscale

Indexes K-Scale Labs robotics skills for humanoid robot development, RL training, sim-to-real transfer, and deployment. Organizes 9 skills in GF(3) triadic structure.

3 files

asi

stable-baselines3

Guides training RL agents with Stable Baselines3 (PPO, SAC, DQN, TD3, DDPG, A2C), custom Gym environments, callbacks for monitoring, vectorized envs for parallel training, and deep RL workflows.

7 files

scientific-skills

mindspeed-mm-pipeline

Routes to MindSpeed-MM skills for Huawei Ascend NPU multimodal training pipelines by model type (VLM/understanding, generative, omni, audio). Provides workflow overviews and model index.

2 files

mindspeed-mm-skills

Stats

Parent Repo Stars16

Parent Repo Forks5

Last CommitFeb 16, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

EdgeVLA Skill

Trit: -1 (MINUS - analysis/verification) Color: #DBA51D (Golden Yellow) URI: skill://evla-vla#DBA51D

Overview

EdgeVLA is an open-source edge vision-language-action model for robotics. It standardizes diverse robotics datasets from the Open-X Embodiment (OXE) collection for consistent training and deployment.

Architecture

┌────────────────────────────────────────────────────────────────┐
│                     EdgeVLA ARCHITECTURE                        │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │              Open-X Embodiment Datasets                   │  │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐         │  │
│  │  │ DROID   │ │ Bridge  │ │ LIBERO  │ │ RT-X    │ + 60... │  │
│  │  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘         │  │
│  └───────┼───────────┼───────────┼───────────┼──────────────┘  │
│          │           │           │           │                  │
│          ▼           ▼           ▼           ▼                  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │           OXE_DATASET_CONFIGS Standardization            │  │
│  │  • image_obs_keys: primary, secondary, wrist cameras     │  │
│  │  • state_encoding: POS_EULER, POS_QUAT, JOINT           │  │
│  │  • action_encoding: EEF_POS, JOINT_POS                  │  │
│  └──────────────────────────────────────────────────────────┘  │
│                              │                                  │
│                              ▼                                  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                  Unified Data Format                      │  │
│  │  ┌─────────────────────────────────────────────────────┐ │  │
│  │  │ Images: resized, normalized, multi-view             │ │  │
│  │  │ States: 8-dim standardized proprioception           │ │  │
│  │  │ Actions: 7-dim EEF or joint actions                 │ │  │
│  │  └─────────────────────────────────────────────────────┘ │  │
│  └──────────────────────────────────────────────────────────┘  │
│                              │                                  │
│                              ▼                                  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                     VLA Model                             │  │
│  │  Vision Encoder → Language Model → Action Decoder        │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘

Dataset Configuration

from evla.config import OXE_DATASET_CONFIGS, StateEncoding, ActionEncoding

# DROID dataset configuration
droid_config = OXE_DATASET_CONFIGS["droid"]
# {
#     "image_obs_keys": {
#         "primary": "exterior_image_1_left",
#         "secondary": "exterior_image_2_left",
#         "wrist": "wrist_image_left",
#     },
#     "state_encoding": StateEncoding.POS_QUAT,
#     "action_encoding": ActionEncoding.EEF_POS,
# }

# Bridge dataset configuration
bridge_config = OXE_DATASET_CONFIGS["bridge"]
# {
#     "image_obs_keys": {
#         "primary": "image_0",
#         "wrist": "image_1",
#     },
#     "state_encoding": StateEncoding.POS_EULER,
#     "action_encoding": ActionEncoding.EEF_POS,
# }

Named Mixtures

from evla.config import OXE_NAMED_MIXTURES

# Comprehensive multi-dataset training
oxe_magic_soup = OXE_NAMED_MIXTURES["oxe_magic_soup"]

# RT-X reproduction
rtx_mixture = OXE_NAMED_MIXTURES["rtx"]

# Custom mixture with weights
custom_mixture = {
    "droid": 1.0,
    "bridge": 0.5,
    "libero": 0.3,
}

Usage

from evla import EdgeVLA, DataLoader

# Load model
model = EdgeVLA.from_pretrained("kscale/evla-base")

# Create dataloader with mixture
loader = DataLoader(
    mixture="oxe_magic_soup",
    batch_size=32,
    image_size=(224, 224),
)

# Training loop
for batch in loader:
    images = batch["images"]  # (B, V, H, W, C)
    states = batch["states"]  # (B, 8)
    actions = batch["actions"]  # (B, 7)
    
    loss = model.train_step(images, states, actions)

# Inference
with torch.no_grad():
    image = camera.capture()
    state = robot.get_state()
    action = model.predict(image, state, "pick up the red block")
    robot.execute(action)

Key Contributors

budzianowski: Core architecture, dataset configs, finetuning
moojink: LIBERO eval, dataset transforms
WT-MM: README, integration

GF(3) Triads

This skill participates in balanced triads:

evla-vla (-1) ⊗ kos-firmware (+1) ⊗ mujoco-scenes (0) = 0 ✓
ksim-rl (-1) ⊗ topos-generate (+1) ⊗ evla-vla (-1) = needs balancing

Related Skills

kos-firmware (+1): Robot firmware for deployment
ksim-rl (-1): RL training for locomotion
kbot-humanoid (-1): K-Bot configuration
mujoco-scenes (0): Scene composition

References

@misc{evla2024,
  title={EdgeVLA: Open-Source Edge Vision-Language-Action Model},
  author={K-Scale Labs},
  year={2024},
  url={https://github.com/kscalelabs/evla}
}

@article{openvla2024,
  title={OpenVLA: An Open-Source Vision-Language-Action Model},
  author={Kim, Moo Jin and others},
  journal={arXiv:2406.09246},
  year={2024}
}