From atum-ai-ml
CodeAct (Code-as-Action) agent pattern library — implementation of the CodeAct paradigm by Wang et al. 2024 (Executable Code Actions Elicit Better LLM Agents, ICML 2024) where an LLM agent uses Python code execution as its universal action space instead of structured JSON tool calls. Covers the core CodeAct insight (Python is Turing-complete and composable, while JSON tool calls limit each action to one function call), the architecture (LLM generates Python code, code is executed in a sandbox, output is fed back as observation, loop continues until task done), the key advantages over JSON function calling (composability — chain operations in one action, control flow — if/for/while in one action, error recovery — try/except in one action, math/data manipulation natively, unlimited action space without redefining tools), benchmark gains reported in the paper (CodeAct outperforms JSON tool use by 20% on average across multiple benchmarks like MINT and ToolBench), the sandbox requirement (E2B for cloud sandbox, Daytona for local Docker, Modal for serverless, Pyodide for browser, Jupyter kernel for notebook environments), the security model (sandbox isolation, network restrictions, filesystem restrictions, resource limits, package whitelist), comparison with ReAct (ReAct uses string actions parsed by regex, CodeAct uses Python directly), comparison with function calling (function calling has structured outputs and validation, CodeAct has flexibility and composability), production frameworks (OpenDevin uses CodeAct, Smolagents from Hugging Face has CodeAgent class native, AutoGen has Code Executor agents, LangChain has PythonREPLTool but not full CodeAct), use cases where CodeAct shines (data analysis, math problems, multi-step transformations, web scraping, file operations, scientific computing), and the limitations (security risk if sandbox misconfigured, harder to enforce structured outputs, requires execution environment). Use when an agent needs to perform complex multi-step computations, data manipulations, or operations that don't fit into discrete tool calls. Differentiates from generic ReAct by using executable code as the action layer.
npx claudepluginhub arnwaldn/atum-plugins-collection --plugin atum-ai-mlThis skill uses the workspace's default tool permissions.
Pattern publié par **Wang et al. 2024** (UIUC + Anthropic, ICML 2024). "Executable Code Actions Elicit Better LLM Agents" propose une alternative au function calling JSON : laisser le LLM écrire et lancer du **code Python** comme action universelle.
Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.
Guides idea refinement into designs: explores context, asks questions one-by-one, proposes approaches, presents sections for approval, writes/review specs before coding.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Pattern publié par Wang et al. 2024 (UIUC + Anthropic, ICML 2024). "Executable Code Actions Elicit Better LLM Agents" propose une alternative au function calling JSON : laisser le LLM écrire et lancer du code Python comme action universelle.
JSON Function Calling vs CodeAct
───────────────────── ──────────
Action = call_tool({ Action = ```python
name: "search", results = search("Q1")
args: {"q": "Q1"} filtered = [r for r in results if r.score > 0.8]
}) summary = summarize(filtered[:3])
print(summary)
```
→ 1 action = 1 call → 1 action = N calls + control flow
→ Tools rigides, prédéfinis → Composabilité illimitée
Python est Turing-complet et composable. JSON function calling ne l'est pas. CodeAct exploite cette différence pour donner à l'agent une expressivité bien supérieure.
┌─────────────────────────────────────────────────────────┐
│ CodeAct LOOP │
└─────────────────────────────────────────────────────────┘
[TASK]
│
▼
┌─────────┐
│ LLM │ ─────► Génère code Python
└────┬────┘
│
▼
┌──────────────┐
│ SANDBOX │ (E2B / Daytona / Modal / Jupyter)
│ │
│ run code │
│ isolated │
│ │
└─────┬────────┘
│ stdout/stderr/return value
▼
┌──────────────┐
│ OBSERVATION │
└──────┬───────┘
│
└─── feed back to LLM as next message
│
▼
(continue until task done)
ReAct + JSON Function Calling (5 actions) :
Action 1: search("villes françaises plus peuplées")
Observation 1: "Paris, Marseille, Lyon..."
Action 2: get_population("Paris")
Observation 2: 2161000
Action 3: get_population("Marseille")
Observation 3: 870000
Action 4: get_population("Lyon")
Observation 4: 522000
Action 5: calculator("(2161000+870000+522000)/1000000")
Observation 5: 3.553
Final: 3.55 millions
→ 5 round-trips LLM, 5 latences, 5 coûts.
CodeAct (1 action) :
Action 1:
```python
cities = ["Paris", "Marseille", "Lyon"]
populations = [get_population(c) for c in cities]
total_millions = sum(populations) / 1_000_000
print(f"Total: {total_millions:.2f} millions")
Observation 1: "Total: 3.55 millions" Final: 3.55 millions
→ **1 round-trip**, 1 latence, 1 coût. **5x plus efficace**.
## Composants
### 1. LLM
N'importe quel LLM moderne capable de générer du Python (Claude, GPT-4/5, Llama 3, Qwen, DeepSeek, Gemini). Les meilleurs : ceux trained on code.
### 2. Sandbox
**Critique** : ne JAMAIS lancer le code généré par un LLM dans le process de l'app. Toujours dans un sandbox isolé.
| Sandbox | Type | Avantages | Inconvénients |
|---|---|---|---|
| **E2B** | Cloud (managed) | Facile, scalable, persistance | Coût, latence réseau |
| **Daytona** | Local Docker | Contrôle total, OSS | Setup, gestion des ressources |
| **Modal** | Serverless functions | Auto-scaling, GPU available | Vendor-specific |
| **Pyodide** | Browser WASM | Aucun backend | Limité (pas de filesystem) |
| **Jupyter Kernel** | Notebook | Simple à intégrer | Pas vraiment isolé |
| **Docker plain** | Container | Standard, portable | Overhead start, gestion manuelle |
| **gVisor / Firecracker** | Kernel-isolated VM | Sécurité maximale | Complexité ops |
### 3. Tool exposition
Les "tools" deviennent simplement des fonctions Python exposées dans l'environnement du sandbox :
```python
# Pre-loaded dans le sandbox
def search(query: str) -> list[dict]: ...
def get_population(city: str) -> int: ...
def send_email(to: str, body: str) -> bool: ...
# Le LLM peut maintenant écrire :
results = search("Paris attractions")
top_3 = sorted(results, key=lambda x: x["rating"], reverse=True)[:3]
for r in top_3:
print(r["name"], r["rating"])
from e2b_code_interpreter import Sandbox
def codeact_agent(task: str, max_iterations: int = 10):
sandbox = Sandbox()
# Pre-load tools as Python functions
sandbox.run_code(TOOLS_PRELUDE)
messages = [
{"role": "system", "content": CODEACT_SYSTEM_PROMPT},
{"role": "user", "content": task},
]
for i in range(max_iterations):
response = llm_call(messages)
# Extract Python code block
code = extract_code_block(response)
if code is None:
return response # Final answer
# Run in isolated sandbox
execution = sandbox.run_code(code)
observation = execution.text or execution.error
messages.append({"role": "assistant", "content": response})
messages.append({"role": "tool", "content": f"```\n{observation}\n```"})
sandbox.kill()
return "Max iterations reached"
| Benchmark | JSON Tool Use | CodeAct | Gain |
|---|---|---|---|
| MINT (multi-turn) | 24.4% | 33.7% | +9 pts |
| Mini-InterCode SQL | 32.0% | 48.0% | +16 pts |
| ToolBench (real APIs) | 65.6% | 76.9% | +11 pts |
| Average | — | — | +20% vs JSON |
Les modèles trained on code (DeepSeek-Coder, Qwen-Coder, Codestral, Claude Sonnet) bénéficient le plus.
| Framework | CodeAct support | Notes |
|---|---|---|
| Smolagents (Hugging Face) | Native via CodeAgent | Modern, simple, recommandé pour 2026 |
| OpenDevin | Native, c'est l'architecture core | Agent de coding/dev autonome |
| AutoGen (Microsoft) | Via CodeExecutor agents | Multi-agent friendly |
| LangChain | Partiel via PythonREPLTool | Pas vrai CodeAct, juste un tool |
| Pas un framework, build custom | Si besoin de contrôle total | ~200 lignes pour MVP |
Règle absolue : code généré par LLM = untrusted code. Toujours sandbox.
/tmpsandbox = Sandbox(
template="python", # template officiel
timeout=30, # 30s max
metadata={"user_id": user.id, "session": session.id},
)
# E2B isole automatiquement (pas d'accès au host)
| Pattern | Quand utiliser |
|---|---|
| JSON function calling | Tool use simple, structured outputs critiques |
| ReAct | Tool use complexe avec raisonnement explicite, sans sandbox |
| CodeAct | Math, data, multi-step composable, scientific computing |
| CoT pur | Pas d'outils, raisonnement seul |
| Tree-of-Thoughts | Décisions multi-options |
| Scénario | Recommandation |
|---|---|
| Math, data analysis, transformations | CodeAct |
| Web scraping multi-pages avec parsing | CodeAct |
| Workflow déterministe, structured output | JSON function calling |
| API calls simples (1-2 calls/turn) | JSON function calling |
| Code generation pour user (pas run) | LLM direct, pas d'agent |
| Latency-critical (<1s) | JSON function calling (sandbox = overhead) |
| Pas de sandbox disponible | JSON function calling |
| Compliance stricte (pas de code run) | JSON function calling |
react-pattern (ce plugin)reflexion-pattern (ce plugin)prompt-engineer (ce plugin)security-expert (atum-compliance)