From greennode-agentbase
Monitor deployed AI agents: view runtime logs, endpoint logs, resource metrics (CPU/RAM), and a unified platform dashboard. Activate for debugging, status checks, or performance inspection.
How this skill is triggered — by the user, by Claude, or both
Slash command
/greennode-agentbase:agentbase-monitorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Monitor, debug, and view status of agents running on GreenNode AgentBase Runtime.
Monitor, debug, and view status of agents running on GreenNode AgentBase Runtime.
Read the shared auth setup reference at /agentbase skill's references/auth-setup.md for full IAM credential configuration. In brief: run bash .claude/skills/agentbase/scripts/check_credentials.sh iam to verify credentials are configured. NEVER read .greennode.json or .env directly — always use the helper scripts. If check_credentials.sh iam returns MISSING, STOP — you MUST read the "If Credentials Are Not Found" section in /agentbase skill's references/auth-setup.md and follow it exactly. Do NOT skip this or provide your own credential setup instructions.
Fetch logs from an agent runtime container.
Parameters:
--from N (int, max 5000) -- starting offset (0-based)--limit N (int, max 500) -- number of log lines to return--from-time ISO (string, optional) -- start of time range filter (ISO 8601)--to-time ISO (string, optional) -- end of time range filter (ISO 8601)--query TEXT (string, optional) -- keyword search filter--order asc|desc (string, optional) -- log orderingResponse (LogSearchResult): totalCount (int), logs (array of LogRecord with timestamp (string) and content (string)).
Command:
# Basic log fetch (most recent 100 entries)
bash .claude/skills/agentbase/scripts/runtime.sh logs $RUNTIME_ID --from 0 --limit 100
# With time range and keyword search
bash .claude/skills/agentbase/scripts/runtime.sh logs $RUNTIME_ID \
--from 0 --limit 100 \
--from-time "2026-03-13T00:00:00Z" \
--to-time "2026-03-13T12:00:00Z" \
--query "error"
Tips:
--from to paginate through large log sets (e.g. --from 100 to skip first 100 entries)--limit is 500, max --from is 5000--query to filter logs by keyword server-side (e.g. --query "error")--from-time/--to-time to narrow logs to a specific time windowtimestamp and content fieldsFetch logs from a specific endpoint within a runtime.
Parameters: Same as runtime-logs (--from, --limit, --from-time, --to-time, --query).
Command:
bash .claude/skills/agentbase/scripts/runtime.sh endpoints logs $RUNTIME_ID $ENDPOINT_ID \
--from 0 --limit 100
Get CPU and RAM usage metrics for a specific endpoint. Supports historical time range queries.
Query parameters:
--from-time ISO (string, optional) -- start of time range (ISO 8601)--to-time ISO (string, optional) -- end of time range (ISO 8601)Command:
# Current metrics
bash .claude/skills/agentbase/scripts/runtime.sh endpoints metrics $RUNTIME_ID $ENDPOINT_ID
# Historical metrics with time range
bash .claude/skills/agentbase/scripts/runtime.sh endpoints metrics $RUNTIME_ID $ENDPOINT_ID \
--from-time "2026-03-13T00:00:00Z" --to-time "2026-03-13T12:00:00Z"
Response (AgentRuntimeEndpointMetrics):
cpuCoresUsage -- array of {timestamp (date-time), value (double)} data pointsmemoryBytesUsage -- array of {timestamp (date-time), value (int64)} data pointsFetch the infrastructure-level events emitted while deploying and running an endpoint. This is the first place to look when an endpoint is not ACTIVE but the logs are empty — startup failures (image pull errors, out-of-memory kills, scheduling/capacity failures, health-probe failures) surface here as events before any application log is produced.
Command:
bash .claude/skills/agentbase/scripts/runtime.sh endpoints events $RUNTIME_ID $ENDPOINT_ID
Response: array of event objects, each with:
message (string) -- the event message (e.g. Back-off pulling image ..., out of memory, insufficient capacity)lastTimestamp (date-time) -- when the event last occurredCommon event signatures (match against the message text):
| Event message indicates | Meaning | Next step |
|---|---|---|
Image pull failure (e.g. pulling image / ErrImagePull) | Image cannot be pulled | Verify imageUrl and registry credentials (imageAuth) on the runtime version |
Out of memory (e.g. OOM / out of memory) | Instance exceeded its memory limit | Scale up the flavor or fix the memory leak (cross-check metrics) |
Scheduling / capacity failure (e.g. insufficient / no capacity) | No capacity to place the instance | Usually a flavor/capacity issue — try a smaller flavor or retry later |
Health probe failure (e.g. probe failed) | /health not returning 200 | Verify the health endpoint (see Log Analysis Guide) |
| Crash / restart loop | Instance keeps crashing on startup | Check endpoint logs for the startup traceback |
Query distributed traces for agent runtimes. These commands are a thin passthrough to the platform's tracing backend: the accepted query parameters (other than traceId / tagKey) and the response body shape are defined by that backend, not by the runtime API spec. Pass backend query params verbatim via repeated --param key=value; the response is the backend's raw JSON string.
Param semantics not documented here. Do NOT invent trace query param names. If the user needs specific filters (time range, service, tags, min duration, etc.), source the exact param keys from the tracing backend's own documentation or the console's network calls before using them. Without that, the commands still work as a raw passthrough.
Commands:
# Search traces (params forwarded to the tracing backend)
bash .claude/skills/agentbase/scripts/runtime.sh traces search --param key=value [--param key=value ...]
# Get a single trace by ID
bash .claude/skills/agentbase/scripts/runtime.sh traces get $TRACE_ID [--param key=value ...]
# List available values for a trace tag key (for building filters)
bash .claude/skills/agentbase/scripts/runtime.sh traces tag-values $TAG_KEY [--param key=value ...]
--param values are URL-encoded automatically. The response is returned as-is (a JSON string from the backend) — parse it according to the backend's schema.
| Feature | Status |
|---|---|
| Log filtering by level (INFO/WARN/ERROR) | Not supported — all log levels are returned together |
| Log time range filter | Supported — use --from-time/--to-time |
| Log keyword search | Supported — use --query |
| Historical metrics | Supported — use --from-time/--to-time |
| Log streaming/tailing | Not supported — use polling as a workaround (see below) |
| Alerting/thresholds | Not supported |
Pseudo-tailing pattern: To approximate log tailing, poll the logs command every 5-10 seconds with an increasing --from offset. Inform the user of the polling limit at the start (e.g., "Tailing logs for up to 5 minutes..."). After reaching the limit, inform the user and offer to restart tailing. Be mindful of rate limits:
OFFSET=0
LIMIT=100
MAX_POLLS=60
for i in $(seq 1 $MAX_POLLS); do
RESULT=$(bash .claude/skills/agentbase/scripts/runtime.sh logs $RUNTIME_ID --from $OFFSET --limit $LIMIT)
BATCH_SIZE=$(echo "$RESULT" | jq '.logs | length')
if [ "$BATCH_SIZE" -gt 0 ] 2>/dev/null; then
echo "$RESULT" | jq -r '.logs[].content'
OFFSET=$((OFFSET + BATCH_SIZE))
fi
sleep 5
done
When reviewing logs, look for these patterns:
| Pattern | Meaning | Next Step |
|---|---|---|
Traceback (most recent call last) | Python exception — read the last line for the actual error | Check the exception type and message at the bottom of the traceback |
ModuleNotFoundError: No module named '...' | Missing dependency | Add the module to requirements.txt and rebuild |
ImportError: cannot import name '...' | Wrong package version or API change | Check package version compatibility |
ConnectionRefusedError / ConnectionError | Cannot reach external service | Verify the service URL, check if auth credentials are injected correctly |
401 Unauthorized / 403 Forbidden | Authentication/authorization failure | Check IAM token, service account permissions, or external API key |
OSError: [Errno 98] Address already in use | Port conflict (usually 8080) | Ensure only one process binds to port 8080 |
MemoryError / Killed | Out of memory | Scale up flavor or optimize memory usage |
TimeoutError / ReadTimeout | External API or LLM call timed out | Increase timeout, check LLM endpoint health |
KeyError: '...' | Missing expected field in payload/response | Check payload format matches what handler expects |
Health check failed | /health endpoint not returning 200 | Verify @app.ping is defined and returns PingStatus.HEALTHY |
Use this flow to diagnose common issues:
Agent not responding?
├─ Check runtime status (/agentbase-deploy runtime get)
│ ├─ Status = FAILED → Check runtime logs (see runtime-logs above)
│ ├─ Status = CREATING → Wait, then re-check
│ └─ Status = ACTIVE → Check endpoint logs (see endpoint-logs above)
│ ├─ Logs show Python traceback → Fix the code error
│ ├─ Logs show "Health check failed" → Fix health endpoint
│ ├─ No recent logs → Check endpoint events (see events above) for infrastructure-level
│ │ failures (image pull, out-of-memory, capacity), then metrics
│ └─ Logs look normal → Issue may be in request routing, check endpoint URL
Agent returns errors (4xx/5xx)?
├─ 500 Internal Server Error → Check endpoint logs for traceback (see endpoint-logs above)
├─ 502 Bad Gateway → Container crashed or not ready, check runtime logs (see runtime-logs above)
├─ 503 Service Unavailable → Container starting up or overloaded, check metrics (see metrics above)
└─ 401/403 → Check if agent's outbound auth is configured (/agentbase-identity)
Agent is slow?
├─ Check metrics for CPU/RAM (see metrics above)
│ ├─ CPU near limit → CPU-bound (e.g., stuck loop, heavy computation)
│ │ └─ Scale up flavor or optimize code
│ ├─ RAM near limit → Memory-bound (e.g., large model in memory, data leak)
│ │ └─ Scale up flavor or fix memory leak
│ └─ Both low → Bottleneck is external (LLM API, database, network)
│ └─ Check logs for slow external calls, add request timing
Use server-side filtering when possible, and client-side techniques for finer control:
# Server-side: keyword search via --query
bash .claude/skills/agentbase/scripts/runtime.sh logs $RUNTIME_ID --from 0 --limit 100 --query "error"
# Server-side: time range filter
bash .claude/skills/agentbase/scripts/runtime.sh logs $RUNTIME_ID \
--from 0 --limit 100 \
--from-time "2026-03-13T00:00:00Z" --to-time "2026-03-13T12:00:00Z"
# Client-side: filter fetched results locally
LOGS=$(bash .claude/skills/agentbase/scripts/runtime.sh logs $RUNTIME_ID --from 0 --limit 100)
echo "$LOGS" | jq -r '.logs[].content' | grep -i "error\|traceback\|exception\|failed"
# Client-side: show only the last N lines
echo "$LOGS" | jq -r '.logs[].content' | tail -n 20
# Client-side: count errors
echo "$LOGS" | jq -r '.logs[].content' | grep -ci "error"
| Error | Cause | Fix |
|---|---|---|
| Agent not responding | Runtime crashed or not started | Check runtime status (/agentbase-deploy runtime get), then check runtime logs for crash messages |
| 502/503 errors on endpoint | Container startup failure | Check endpoint logs for startup failures, verify health endpoint returns 200 |
| High latency | Resource saturation | Check metrics for CPU/RAM saturation, consider scaling up flavor or replicas |
| OOM kills | Memory spikes exceeding limit | Check metrics for memory spikes, increase flavor size |
| Image pull errors | Wrong URL or missing credentials | Verify imageUrl and registry credentials in runtime config |
| Container crash loop | Code error or missing dependencies | Check runtime logs for Python tracebacks or missing dependencies |
Show a unified dashboard of all AgentBase resources across services.
bash .claude/skills/agentbase/scripts/discovery.shCould not fetch (error details) instead of crashingFormat the results into a readable dashboard. Example:
AgentBase Status Dashboard
==========================
IAM: Configured (client_id: abc...xyz)
Agent Identities (2):
my-agent - "My first agent"
test-bot - "Testing bot"
Auth Providers:
API Keys (1): openai-prod (ACTIVE)
Delegated (0): none
OAuth2 (0): none
Runtimes (1):
my-agent-rt (ACTIVE, v3, 1x1-general)
DEFAULT: https://...
Memory (1):
my-memory (2 strategies, 30d expiry)
AI Platform:
API Keys (1): my-key (ACTIVE)
Container Registry:
Repos (1): my-repo (private)
bash .claude/skills/agentbase/scripts/runtime.sh endpoints list $RUNTIME_ID. (Note: runtime list DTO contains id, name, description, status, statusReason, createdAt, updatedAt — for flavor/version details, use bash .claude/skills/agentbase/scripts/runtime.sh versions $RUNTIME_ID which returns version, imageUrl, flavorId, autoscaling per version.)If any individual API call fails, display that section with the error instead of failing the whole dashboard:
Runtimes:
Could not fetch (401 Unauthorized - token may be expired)
If the user passes --json, output the raw JSON responses from all APIs as a single JSON object instead of the formatted dashboard:
{
"agentIdentities": { ... },
"apiKeyProviders": { ... },
"delegatedApiKeyProviders": { ... },
"oauth2Providers": { ... },
"runtimes": { ... },
"memories": { ... },
"aipApiKeys": { ... },
"crRepository": { ... }
}
Different services use different pagination:
For the status dashboard, fetch the first page of each service with a reasonable size (e.g., size=10). If a service has more items than displayed, show the total count and offer to show more (e.g., "Showing 10 of 25 runtimes. Want to see more?").
runtime-logs, endpoint-logs, metrics, events, traces, dashboard).bash .claude/skills/agentbase/scripts/runtime.sh list) and ask the user to pick one.
b. If an endpoint ID is needed, list endpoints for the runtime (bash .claude/skills/agentbase/scripts/runtime.sh endpoints list $RUNTIME_ID) and ask the user to pick one.
c. For logs, default to --from 0 --limit 100 to fetch the most recent entries. Use --from to paginate if more logs are needed (max --from: 5000, max --limit: 500). Use --query for keyword filtering and --from-time/--to-time for time range filtering when the user specifies these.
d. Present log output in a readable format, highlighting errors and warnings. Each log entry has timestamp and content fields.
e. For metrics, display CPU (cpuCoresUsage) and RAM (memoryBytesUsage, convert values to MB/GB) as time-series data points. Use --from-time/--to-time for historical ranges. To show usage as percentages, fetch available flavors via bash .claude/skills/agentbase/scripts/runtime.sh flavors (returns id, name, cpu, ram for each flavor), then match the runtime's flavorId to get CPU/RAM limits.
f. For events, present each event as lastTimestamp — message, newest first. Match messages against the "Common event signatures" table to suggest a next step. Use this especially when an endpoint is stuck out of ACTIVE and logs are empty.--param keys the user supplies or that you sourced from the backend's docs/console. Return the backend's JSON response and parse per its schema.--json flag.
b. Run bash .claude/skills/agentbase/scripts/discovery.sh (or bash .claude/skills/agentbase/scripts/discovery.sh json for raw JSON).
c. Format the results as a dashboard (or raw JSON if --json).
d. Display the dashboard to the user.
e. After displaying, offer to drill into any section or show more items if pagination was truncated.npx claudepluginhub vngcloud/greennode-agentbase-skills --plugin greennode-agentbaseDiagnoses AgentCore agent failures including wrong answers, errors, timeouts, tool issues, and CLI problems by reading traces, logs, and checking prerequisites.
Reference guide for GreenNode AgentBase platform: architecture, services (Identity, Runtime, Memory, Observability), SDK, IAM setup, and credentials. Activated for platform overview questions.
Comprehensive observability setup patterns for Google ADK agents including logging configuration, Cloud Trace integration, BigQuery Agent Analytics, and third-party observability tools (AgentOps, Phoenix, Weave). Use when implementing monitoring, debugging agent behavior, analyzing agent performance, setting up tracing, or when user mentions observability, logging, tracing, BigQuery analytics, AgentOps, Phoenix, Arize, or Weave.