From truefoundry-gateway
Configures TrueFoundry AI Gateway for unified OpenAI-compatible LLM access, covering PAT/VAT auth, model routing, rate limiting, and budget controls.
npx claudepluginhub truefoundry/tfy-gateway-skills --plugin truefoundry-gatewayThis skill is limited to using the following tools:
> Routing note: For ambiguous user intents, use the shared clarification templates in [references/intent-clarification.md](references/intent-clarification.md).
references/api-endpoints.mdreferences/cli-fallback.mdreferences/cluster-discovery.mdreferences/container-versions.mdreferences/gpu-reference.mdreferences/health-probes.mdreferences/intent-clarification.mdreferences/manifest-defaults.mdreferences/manifest-schema.mdreferences/prerequisites.mdreferences/resource-estimation.mdreferences/rest-api-manifest.mdreferences/tfy-api-setup.mdscripts/tfy-api.shscripts/tfy-version.shProvides expert guidance for Vercel AI Gateway configuration including model routing, provider failover, cost tracking, and multi-provider management via unified API.
Manages TrueFoundry LLM provider account integrations: add, list, and configure accounts for OpenAI, AWS Bedrock, Google Vertex, Azure, Groq, Together AI, and self-hosted models.
Provides reference architectures for production OpenRouter LLM gateway setups with caching, rate limiting, observability, from simple to enterprise scale.
Share bugs, ideas, or general feedback.
Routing note: For ambiguous user intents, use the shared clarification templates in references/intent-clarification.md.
Use TrueFoundry's AI Gateway to access 1000+ LLMs through a unified OpenAI-compatible API with rate limiting, budget controls, load balancing, routing, and observability.
Access LLMs through TrueFoundry's unified OpenAI-compatible gateway, configure auth tokens (PAT/VAT), set up rate limiting, budget controls, or load balancing across providers.
status skill; ask if the user wants another valid pathThe AI Gateway sits between your application and LLM providers:
Your App → AI Gateway → OpenAI / Anthropic / Azure / Self-hosted vLLM / etc.
↑
Unified API + Auth + Rate Limiting + Routing + Logging
Key benefits:
The gateway base URL is your TrueFoundry platform URL + /api/llm:
{TFY_BASE_URL}/api/llm
Example: https://your-org.truefoundry.cloud/api/llm
For development and individual use:
For production applications (recommended):
VATs are recommended for production because:
from openai import OpenAI
client = OpenAI(
api_key="<your-PAT-or-VAT>",
base_url="https://<your-truefoundry-url>/api/llm",
)
# Chat completion
response = client.chat.completions.create(
model="openai/gpt-4o", # or any configured model name
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
],
max_tokens=200,
)
print(response.choices[0].message.content)
stream = client.chat.completions.create(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about AI"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
curl "${TFY_BASE_URL}/api/llm/chat/completions" \
-H "Authorization: Bearer ${TFY_API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 200
}'
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "<your-PAT-or-VAT>",
baseURL: "https://<your-truefoundry-url>/api/llm",
});
const response = await client.chat.completions.create({
model: "openai/gpt-4o",
messages: [{ role: "user", content: "Hello!" }],
});
Set these to use with any OpenAI-compatible library:
export OPENAI_BASE_URL="${TFY_BASE_URL}/api/llm"
export OPENAI_API_KEY="<your-PAT-or-VAT>"
Then any code using openai.OpenAI() without explicit parameters will use the gateway automatically.
| API | Endpoint | Description |
|---|---|---|
| Chat Completions | /chat/completions | Chat with any model (streaming + non-streaming) |
| Completions | /completions | Legacy text completions |
| Embeddings | /embeddings | Text embeddings (text + list inputs) |
| Image Generation | /images/generations | Generate images |
| Image Editing | /images/edits | Edit images |
| Audio Transcription | /audio/transcriptions | Speech-to-text |
| Audio Translation | /audio/translations | Translate audio |
| Text-to-Speech | /audio/speech | Generate speech |
| Reranking | /rerank | Rerank documents |
| Batch Processing | /batches | Batch predictions |
| Moderations | /moderations | Content safety |
The gateway supports 25+ providers including:
| Provider | Example Model Names |
|---|---|
| OpenAI | openai/gpt-4o, openai/gpt-4o-mini |
| Anthropic | anthropic/claude-sonnet-4-5-20250929 |
| Google Vertex | google/gemini-2.0-flash |
| AWS Bedrock | bedrock/anthropic.claude-3-5-sonnet |
| Azure OpenAI | azure/gpt-4o |
| Mistral | mistral/mistral-large-latest |
| Groq | groq/llama-3.1-70b-versatile |
| Cohere | cohere/command-r-plus |
| Together AI | together/meta-llama/Meta-Llama-3.1-70B |
| Self-hosted (vLLM/TGI) | my-custom-model-name |
Model names depend on how they're configured in your gateway. Check the TrueFoundry dashboard → AI Gateway → Models for exact names.
Currently done through the TrueFoundry dashboard UI:
After deploying a self-hosted model:
http://{model-name}.{namespace}.svc.cluster.local:8000Security: Only register model endpoints that you control. External or untrusted model endpoints can return manipulated responses. Use internal cluster DNS (
svc.cluster.local) for self-hosted models. Verify provider API credentials are stored securely in TrueFoundry secrets, not hardcoded.
For externally hosted APIs that are OpenAI-compatible (e.g. NVIDIA Cloud APIs, custom inference endpoints), use type: provider-account/self-hosted-model with auth_data:
# gateway.yaml — External hosted API (e.g. NVIDIA Cloud)
- name: nvidia-external
type: provider-account/self-hosted-model
integrations:
- name: nemotron-nano
type: integration/model/self-hosted-model
hosted_model_name: nvidia/nemotron-3-nano-30b-a3b
url: "https://integrate.api.nvidia.com/v1"
model_server: "openai-compatible"
model_types: ["chat"]
auth_data:
type: bearer-auth
bearer_token: "tfy-secret://<tenant>:<group>:<key>"
And in a virtual model routing target, reference it as "<provider-account-name>/<integration-name>":
targets:
- model: "nvidia-external/nemotron-nano" # "<provider-account-name>/<integration-name>"
Apply with:
tfy apply -f gateway.yaml
WARNING:
provider-account/nvidia-nimdoes not exist in the schema — do not use it. Useprovider-account/self-hosted-modelwithauth_datafor all external OpenAI-compatible APIs (as shown above).
Schema source of truth: For authoritative field names and types, read
servicefoundry-server/src/autogen/models.tsin the platform repo. Do not guess field names from documentation alone.
Gateway YAML is applied directly with tfy apply — no service build or Docker image involved:
# Preview changes
tfy apply -f gateway.yaml --dry-run --show-diff
# Apply
tfy apply -f gateway.yaml
Do NOT delegate gateway applies to a deployment skill. Gateway configs (type: gateway-*, type: provider-account/*) are applied inline with tfy apply.
Test after apply:
# Quick smoke test via curl
curl "${TFY_BASE_URL}/api/llm/chat/completions" \
-H "Authorization: Bearer ${TFY_API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia-external/nemotron-nano",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
Or via Python:
from openai import OpenAI
client = OpenAI(api_key="<PAT-or-VAT>", base_url=f"{TFY_BASE_URL}/api/llm")
resp = client.chat.completions.create(
model="nvidia-external/nemotron-nano",
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)
Note: One-off gateway config applies should use
tfy applydirectly. For CI/CD pipelines, integratetfy applyinto your existing automation.
Virtual models route requests across multiple model instances using a gateway-load-balancing-config manifest. Targets reference real catalog models as "<provider-account-name>/<integration-name>".
name: chat-routing
type: gateway-load-balancing-config
rules:
- id: weighted-chat
type: weight-based-routing
when:
subjects: ["*"]
models: ["openai/gpt-4o"]
load_balance_targets:
- target: "openai-main/gpt-4o"
weight: 70
fallback_candidate: true
retry_config:
delay: 100
attempts: 1
on_status_codes: ["429", "500", "502", "503"]
- target: "azure-backup/gpt-4o"
weight: 30
fallback_candidate: true
retry_config:
delay: 100
attempts: 1
on_status_codes: ["429", "500", "502", "503"]
Automatically routes to the lowest-latency model (measures time per output token over last 20 minutes):
rules:
- id: latency-chat
type: latency-based-routing
when:
subjects: ["*"]
models: ["openai/gpt-4o"]
load_balance_targets:
- target: "openai-main/gpt-4o"
fallback_candidate: true
- target: "azure-backup/gpt-4o"
fallback_candidate: true
Routes to highest-priority healthy model with SLA cutoff (auto-marks unhealthy when TPOT exceeds threshold):
rules:
- id: priority-chat
type: priority-based-routing
when:
subjects: ["team:premium"]
models: ["*"]
load_balance_targets:
- target: "openai-main/gpt-4o"
priority: 0
sla_cutoff:
time_per_output_token_ms: 50
fallback_candidate: true
- target: "azure-backup/gpt-4o"
priority: 1
fallback_candidate: true
Pin users to the same target for a duration:
rules:
- id: sticky-chat
type: weight-based-routing
sticky_routing:
ttl_seconds: 3600
session_identifiers:
- key: x-user-id
source: headers
load_balance_targets:
- target: "openai-main/gpt-4o"
weight: 50
- target: "azure-backup/gpt-4o"
weight: 50
load_balance_targets:
- target: "openai-main/gpt-4o"
weight: 80
headers_override:
set:
x-region: us-east-1
remove:
- x-internal-debug
Fallback is configured per-target inside load_balance_targets:
fallback_status_codes: defaults to ["401", "403", "404", "429", "500", "502", "503"]fallback_candidate: true marks a target as eligible for failoverretry_config.on_status_codes controls which errors trigger retriestfy apply -f gateway-load-balancing-config.yaml --dry-run --show-diff
tfy apply -f gateway-load-balancing-config.yaml
Note: Targets must be real catalog models, not nested virtual models.
Configure rate limits per user, team, model, or custom metadata using a gateway-rate-limiting-config manifest. Only the first matching rule applies — place specific rules before generic ones.
name: rate-limits
type: gateway-rate-limiting-config
rules:
- id: "team-rpm-limit"
when:
subjects: ["team:backend"]
models: ["openai-main/gpt-4o"]
limit_to: 20000
unit: tokens_per_minute
- id: "user-daily-limit"
when:
subjects: ["user:bob@example.com"]
models: ["openai-main/gpt-4o"]
limit_to: 1000
unit: requests_per_day
- id: "per-project-hourly"
when: {}
limit_to: 50000
unit: tokens_per_hour
rate_limit_applies_per: ["metadata.project_id"]
- id: "global-fallback"
when: {}
limit_to: 500
unit: requests_per_minute
rate_limit_applies_per: ["user"]
Units: requests_per_minute, requests_per_hour, requests_per_day, tokens_per_minute, tokens_per_hour, tokens_per_day
rate_limit_applies_per: Creates separate limits per entity (max 2 values). Options: user, model, virtualaccount, metadata.<key>.
tfy apply -f gateway-rate-limiting-config.yaml
Enforce cost limits per user, team, or metadata using a gateway-budget-config manifest. Costs are tracked automatically based on model pricing.
name: budget-controls
type: gateway-budget-config
rules:
- id: "team-monthly-budget"
when:
subjects: ["team:engineering"]
limit_to: 5000
unit: cost_per_month
budget_applies_per: ["team"]
alerts:
thresholds: [75, 90, 100]
notification_target:
- type: email
notification_channel: "budget-alerts"
to_emails: ["lead@example.com"]
- id: "user-daily-budget"
when: {}
limit_to: 100
unit: cost_per_day
budget_applies_per: ["user"]
- id: "project-daily-budget"
when:
metadata:
environment: "production"
limit_to: 200
unit: cost_per_day
budget_applies_per: ["metadata.project_id"]
Units: cost_per_day (resets UTC midnight), cost_per_week (resets Monday), cost_per_month (resets 1st)
budget_applies_per: Same options as rate limiting — user, model, team, virtualaccount, metadata.<key>.
Alerts: Configure threshold percentages with email, Slack webhook, or Slack bot notifications.
tfy apply -f gateway-budget-config.yaml
All gateway requests are logged with:
Tag requests with custom metadata for tracking:
response = client.chat.completions.create(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
extra_headers={
"X-TFY-LOGGING-CONFIG": '{"project": "my-app", "environment": "production"}'
},
)
View usage analytics in TrueFoundry dashboard:
Export traces to your observability stack:
For content filtering, PII detection, prompt injection prevention, and custom safety rules, use the guardrails skill. It configures guardrail providers and rules that apply to this gateway's traffic.
If a user has already deployed a tool server and wants to attach it to MCP gateway:
mcp-servers skill)The gateway works with popular AI frameworks:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="openai/gpt-4o",
api_key="<your-PAT-or-VAT>",
base_url="https://<your-truefoundry-url>/api/llm",
)
from llama_index.llms.openai import OpenAI
llm = OpenAI(
model="openai/gpt-4o",
api_key="<your-PAT-or-VAT>",
api_base="https://<your-truefoundry-url>/api/llm",
)
Configure the gateway as a custom API endpoint in your coding assistant settings:
{TFY_BASE_URL}/api/llmWhen the user asks about gateway configuration:
AI Gateway:
Endpoint: https://your-org.truefoundry.cloud/api/llm
Auth: Personal Access Token (PAT) or Virtual Access Token (VAT)
Available Models (check dashboard for current list):
| Model Name | Provider | Type |
|-------------------|-------------|-------------|
| openai/gpt-4o | OpenAI | Cloud |
| my-gemma-2b | Self-hosted | vLLM (T4) |
| anthropic/claude | Anthropic | Cloud |
Usage:
export OPENAI_BASE_URL="https://your-org.truefoundry.cloud/api/llm"
export OPENAI_API_KEY="your-token"
# Then use any OpenAI-compatible SDK
<success_criteria>
</success_criteria>
tfy apply; for CI/CD pipelines, integrate tfy apply into your automationGateway authentication failed. Check:
- API key (PAT or VAT) is valid and not expired
- Using correct header: Authorization: Bearer <token>
Model access denied. Your token may not have access to this model.
- PATs inherit user permissions
- VATs only have access to explicitly selected models
- Check with your admin to grant model access
Rate limit exceeded. Options:
- Wait and retry (check Retry-After header)
- Request higher limits from admin
- Use load balancing to distribute across providers
Upstream provider error. The gateway will automatically:
- Retry on configured status codes
- Fallback to alternate models if routing is configured
If persistent, check provider status page or self-hosted model health.
Model name not found in gateway. Check:
- Exact model name in TrueFoundry dashboard → AI Gateway → Models
- Provider account is active and model is enabled
- Your token has access to this model