Skill

freellmapi-proxy

Deploys self-hosted OpenAI-compatible proxy aggregating 14 free-tier LLM providers (Groq, Gemini, Cerebras) with automatic failover, per-key rate tracking, sticky sessions, and React dashboard for key management.

Node

React

Install

npx claudepluginhub joshuarweaver/cascade-ai-ml-agents-misc-1 --plugin aradotso-trending-skills-37

Tool Access

This skill uses the workspace's default tool permissions.

Preview

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.

SKILL.md

Similar Skills

cache-components

Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.

cache-components

139.2k

mcp-builder

9 files

Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).

anthropics-skills-13

124.2k

canvas-design

20 files

Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.

anthropics-skills-13

124.2k

Stats

Stars36

Forks8

Last CommitApr 25, 2026

Actions

View Source View Plugin View on GitHub View README

FreeLLMAPI Proxy

Skill by ara.so — Daily 2026 Skills collection.

FreeLLMAPI is a self-hosted OpenAI-compatible proxy that aggregates free-tier API keys from ~14 AI providers (Google, Groq, Cerebras, SambaNova, NVIDIA, Mistral, OpenRouter, GitHub Models, Hugging Face, Cohere, Cloudflare, Zhipu, Moonshot, MiniMax) behind a single /v1/chat/completions endpoint. It handles automatic failover on 429/5xx, per-key rate tracking, sticky sessions for multi-turn conversations, and AES-256-GCM encrypted key storage.

Installation

Prerequisites: Node.js 20+, npm.

git clone https://github.com/tashfeenahmed/freellmapi.git
cd freellmapi
npm install

# Generate encryption key and set up environment
cp .env.example .env
echo "ENCRYPTION_KEY=$(node -e "console.log(require('crypto').randomBytes(32).toString('hex'))")" >> .env

# Development (server + Vite dashboard on :5173)
npm run dev

# Production build
npm run build
node server/dist/index.js   # serves API + dashboard on :3001

Environment Variables

# .env
ENCRYPTION_KEY=<64-char hex string>   # Required — AES-256 key for provider key storage
PORT=3001                              # Optional — defaults to 3001
NODE_ENV=production                    # Optional

Never commit .env. The ENCRYPTION_KEY protects all stored provider API keys.

Key Commands

npm run dev        # Start Express server + Vite dashboard in watch mode
npm run build      # Compile TypeScript server + build React dashboard
npm run lint       # ESLint across server/ and client/
npm run test       # Run test suite

Provider Setup

Open the dashboard at http://localhost:5173 (dev) or http://localhost:3001 (prod).
Navigate to Keys page.
Add raw API keys for each provider you have. Keys are encrypted before SQLite storage.
Navigate to Fallback Chain to reorder provider priority.
Copy your unified freellmapi-… bearer token from the Keys page header.

Supported providers and what to put in:

Provider	Where to get a free key
Google Gemini	https://ai.google.dev
Groq	https://groq.com
Cerebras	https://cerebras.ai
SambaNova	https://cloud.sambanova.ai
NVIDIA NIM	https://build.nvidia.com
Mistral	https://mistral.ai
OpenRouter	https://openrouter.ai
GitHub Models	https://github.com/marketplace/models
Hugging Face	https://huggingface.co
Cohere	https://cohere.com
Cloudflare Workers AI	https://developers.cloudflare.com/workers-ai
Zhipu	https://bigmodel.cn
Moonshot	https://platform.moonshot.cn
MiniMax	https://platform.minimax.io

Using the API

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3001/v1",
    api_key="freellmapi-your-unified-key",  # from dashboard Keys page
)

# Let the router pick the best available provider
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Explain async/await in Python in two sentences."}],
)

print(response.choices[0].message.content)
# Which provider actually served this request:
print("Routed via:", response.headers.get("x-routed-via"))

Request a specific model

# Request a specific model — router finds a provider that has it
response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[{"role": "user", "content": "Write a haiku about SQLite."}],
)

Streaming

stream = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "List 5 TypeScript best practices."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()

curl

# Non-streaming
curl http://localhost:3001/v1/chat/completions \
  -H "Authorization: Bearer $FREELLMAPI_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# Streaming
curl http://localhost:3001/v1/chat/completions \
  -H "Authorization: Bearer $FREELLMAPI_KEY" \
  -H "Content-Type: application/json" \
  --no-buffer \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Count to 5 slowly"}],
    "stream": true
  }'

# List available models
curl http://localhost:3001/v1/models \
  -H "Authorization: Bearer $FREELLMAPI_KEY"

TypeScript / Node.js

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:3001/v1",
  apiKey: process.env.FREELLMAPI_KEY,
});

async function chat(userMessage: string): Promise<string> {
  const response = await client.chat.completions.create({
    model: "auto",
    messages: [{ role: "user", content: userMessage }],
  });
  return response.choices[0].message.content ?? "";
}

// Streaming version
async function streamChat(userMessage: string): Promise<void> {
  const stream = await client.chat.completions.create({
    model: "auto",
    messages: [{ role: "user", content: userMessage }],
    stream: true,
  });

  for await (const chunk of stream) {
    const delta = chunk.choices[0]?.delta?.content;
    if (delta) process.stdout.write(delta);
  }
  console.log();
}

Tool Calling

Tool calling works across all supported providers. OpenAI-compatible providers receive requests verbatim; Gemini requests are automatically translated to functionDeclarations/functionResponse format and back.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3001/v1",
    api_key="freellmapi-your-unified-key",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                },
                "required": ["city"],
            },
        },
    }
]

# Step 1: Model requests a tool call
first = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "What's the weather in Karachi?"}],
    tools=tools,
    tool_choice="required",
)

call = first.choices[0].message.tool_calls[0]
print(f"Tool requested: {call.function.name}({call.function.arguments})")

# Step 2: Execute the tool locally, feed result back
final = client.chat.completions.create(
    model="auto",
    messages=[
        {"role": "user", "content": "What's the weather in Karachi?"},
        first.choices[0].message,  # assistant message with tool_calls
        {
            "role": "tool",
            "tool_call_id": call.id,
            "content": '{"temp_c": 32, "condition": "sunny"}',
        },
    ],
    tools=tools,
)

print(final.choices[0].message.content)

Streaming tool calls

stream = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "What's the weather in Karachi?"}],
    tools=tools,
    tool_choice="required",
    stream=True,
)

tool_call_chunks = []
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.tool_calls:
        tool_call_chunks.extend(delta.tool_calls)
    if chunk.choices[0].finish_reason == "tool_calls":
        print("Tool call complete — assemble chunks and execute")

Multi-turn Conversations (Sticky Sessions)

The proxy keeps multi-turn conversations on the same model for 30 minutes to avoid hallucination spikes from mid-conversation model switches. Pass a consistent session_id in requests if the provider supports it, or rely on the proxy's automatic session tracking.

messages = [{"role": "system", "content": "You are a helpful coding assistant."}]

# Turn 1
messages.append({"role": "user", "content": "Write a Python function to flatten a nested list."})
resp1 = client.chat.completions.create(model="auto", messages=messages)
assistant_msg = resp1.choices[0].message
messages.append({"role": "assistant", "content": assistant_msg.content})
print(assistant_msg.content)

# Turn 2 — sticky session keeps same provider
messages.append({"role": "user", "content": "Now add type hints to that function."})
resp2 = client.chat.completions.create(model="auto", messages=messages)
print(resp2.choices[0].message.content)

LangChain Integration

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
import os

llm = ChatOpenAI(
    model="auto",
    openai_api_base="http://localhost:3001/v1",
    openai_api_key=os.environ["FREELLMAPI_KEY"],
    streaming=True,
)

response = llm.invoke([HumanMessage(content="Summarise the CAP theorem in one paragraph.")])
print(response.content)

Response Headers

Every response includes diagnostic headers:

Header	Description
`X-Routed-Via`	`<platform>/<model>` — which provider served the request
`X-Fallback-Attempts`	Number of providers tried before success (only present if > 0)

response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "hi"}],
)
# Headers are on the raw httpx response:
raw = response._response  # openai SDK exposes underlying httpx response
print(raw.headers.get("x-routed-via"))        # e.g. "groq/llama-4-scout"
print(raw.headers.get("x-fallback-attempts")) # e.g. "2"

How the Router Works

Request arrives
      │
      ▼
Router scans fallback chain (priority order)
      │
      ├─ For each model: is there a healthy key under all rate caps?
      │     RPM / RPD / TPM / TPD tracked per (platform, model, key)
      │
      ├─ Picks first viable (platform, model, key) tuple
      │
      ├─ Decrypts key in-memory, calls provider SDK
      │
      └─ On 429 / 5xx / timeout:
            Put key on cooldown → retry next model (up to 20 attempts)

Rate limit tracking: The router tracks RPM, RPD, TPM, and TPD counters per (platform, model, key) triple. When a key hits a cap it's cooled down automatically and the next viable key/model is tried.

Health checks: Background probes classify each key as healthy, rate_limited, invalid, or error. The router skips non-healthy keys without making a live request.

Dashboard Pages

Page	Purpose
Keys	Add/remove provider credentials, view health status, copy unified API key
Fallback Chain	Drag to reorder provider priority
Playground	Interactive chat showing which provider served each message + latency
Analytics	Request volume, success rate, token counts, latency, per-provider breakdown (24h/7d/30d)

Production Deployment (Raspberry Pi / Linux)

# Build
npm run build

# Install PM2
npm install -g pm2

# Start
pm2 start server/dist/index.js --name freellmapi
pm2 save
pm2 startup

# nginx reverse proxy (optional)
# /etc/nginx/sites-available/freellmapi
server {
    listen 80;
    server_name your.domain.com;
    location / {
        proxy_pass http://localhost:3001;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_buffering off;          # Required for SSE streaming
        proxy_cache_control no-cache; # Required for SSE streaming
    }
}

Memory footprint: ~40 MB RSS at idle on a Pi 4.

Adding a New Provider

Create a new adapter in server/src/providers/:

// server/src/providers/myprovider.ts
import type { ProviderAdapter, ChatRequest, ChatResponse } from "../types";

export const myProviderAdapter: ProviderAdapter = {
  name: "myprovider",
  models: ["my-model-v1", "my-model-v2"],

  async chat(request: ChatRequest, apiKey: string): Promise<ChatResponse> {
    // Call provider API, return OpenAI-shaped response
    const res = await fetch("https://api.myprovider.com/v1/chat", {
      method: "POST",
      headers: {
        Authorization: `Bearer ${apiKey}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        model: request.model,
        messages: request.messages,
      }),
    });
    const data = await res.json();
    return {
      id: data.id,
      object: "chat.completion",
      choices: [{ message: data.choices[0].message, finish_reason: "stop", index: 0 }],
      usage: data.usage,
    };
  },

  async *stream(request: ChatRequest, apiKey: string): AsyncGenerator<string> {
    // Yield SSE chunks
  },
};

Troubleshooting

"No healthy keys available"

Check the Keys dashboard — all keys may be rate-limited or invalid.
Wait for cooldown (usually a few minutes for RPM limits) or add more keys.
Verify the key is valid by testing it directly against the provider's API.

Requests always fall back to the same provider

Check the Fallback Chain order in the dashboard.
Ensure keys for higher-priority providers are marked healthy.

Streaming stops mid-response

If behind nginx, ensure proxy_buffering off is set.
Check provider-side token/minute caps — the stream may be cut by a mid-stream rate limit.

ENCRYPTION_KEY error on startup

Ensure ENCRYPTION_KEY in .env is exactly 64 hex characters (32 bytes).
Regenerate: node -e "console.log(require('crypto').randomBytes(32).toString('hex'))"

Tool calls not working with a specific provider

Not all free-tier models support function calling. Check the provider's docs.
Try model="auto" — the router will pick a tool-capable model.
Gemini tool calls are auto-translated; others pass through as-is.

High latency on first request

Health checks run periodically in the background. The first request after startup may probe a few keys. Subsequent requests are faster.

Limitations

Text-only — no vision/multimodal inputs
No embeddings (/v1/embeddings)
No image generation (/v1/images/*)
No audio/speech (/v1/audio/*)
No legacy completions (/v1/completions)
No moderation (/v1/moderations)
n > 1 not supported (single completion per request)
Single-user by design — no per-user billing or multi-tenant auth
Personal/experimental use only — review each provider's ToS before production use