Embedding Models Skill

Embedding model selection, configuration, and cost optimization for RAG pipelines.

Use When

Selecting embedding models for vector search
Configuring OpenAI, Cohere, or HuggingFace embeddings
Calculating embedding generation costs
Optimizing embedding performance vs cost tradeoffs
Setting up local vs cloud embedding models
Implementing embedding caching strategies
User mentions: "embeddings", "vector models", "embedding costs", "semantic search models"

Model Selection Guide

Commercial Models

OpenAI Embeddings:

text-embedding-3-small - 1536 dims, $0.02/1M tokens, balanced performance
text-embedding-3-large - 3072 dims, $0.13/1M tokens, highest quality
text-embedding-ada-002 - 1536 dims, $0.10/1M tokens, legacy model

Cohere Embeddings:

embed-english-v3.0 - 1024 dims, multilingual support
embed-english-light-v3.0 - 384 dims, faster/cheaper
embed-multilingual-v3.0 - 1024 dims, 100+ languages

Open Source Models (HuggingFace)

Sentence Transformers:

all-MiniLM-L6-v2 - 384 dims, 80MB, fast and efficient
all-mpnet-base-v2 - 768 dims, 420MB, high quality
multi-qa-mpnet-base-dot-v1 - 768 dims, optimized for Q&A
paraphrase-multilingual-mpnet-base-v2 - 768 dims, 50+ languages

Specialized Models:

BAAI/bge-small-en-v1.5 - 384 dims, SOTA small model
BAAI/bge-base-en-v1.5 - 768 dims, excellent retrieval
BAAI/bge-large-en-v1.5 - 1024 dims, top performance
intfloat/e5-base-v2 - 768 dims, strong general purpose

Cost Calculator

Use the cost calculator script to estimate embedding costs:

# Calculate costs for different models and volumes
python scripts/calculate-embedding-costs.py \
  --documents 100000 \
  --avg-tokens 500 \
  --model text-embedding-3-small

# Compare multiple models
python scripts/calculate-embedding-costs.py \
  --documents 100000 \
  --avg-tokens 500 \
  --compare

Setup Scripts

OpenAI Embeddings

bash scripts/setup-openai-embeddings.sh

Configures OpenAI embedding client with API key management and retry logic.

HuggingFace Embeddings

bash scripts/setup-huggingface-embeddings.sh

Downloads and configures sentence-transformers models locally.

Cohere Embeddings

bash scripts/setup-cohere-embeddings.sh

Sets up Cohere embedding client with API credentials.

Configuration Templates

OpenAI Configuration

# templates/openai-embedding-config.py
from openai import OpenAI
client = OpenAI(api_key="your-key")

embeddings = client.embeddings.create(
    model="text-embedding-3-small",
    input=["Your text here"]
)

HuggingFace Configuration

# templates/huggingface-embedding-config.py
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(["Your text here"])

Custom Model Template

# templates/custom-embedding-model.py
# Wrapper for any embedding model with consistent interface

Optimization Strategies

Cost Optimization:

Use smaller models for high-volume applications
Implement embedding caching (see examples/embedding-cache.py)
Batch embedding generation (see examples/batch-embedding-generation.py)
Consider local models for sensitive data

Performance Optimization:

Use GPU acceleration for local models
Batch processing for throughput
Dimension reduction for storage/speed
Model distillation for faster inference

Model Comparison Matrix

Model	Dimensions	Size	Speed	Quality	Cost
text-embedding-3-small	1536	API	Fast	Good	$0.02/1M
text-embedding-3-large	3072	API	Medium	Excellent	$0.13/1M
all-MiniLM-L6-v2	384	80MB	Very Fast	Good	Free
all-mpnet-base-v2	768	420MB	Fast	Excellent	Free
bge-base-en-v1.5	768	420MB	Fast	Excellent	Free
embed-english-v3.0	1024	API	Fast	Excellent	$0.10/1M

Examples

Batch Embedding Generation:

# examples/batch-embedding-generation.py
# Process large document collections efficiently

Embedding Cache:

# examples/embedding-cache.py
# Cache embeddings to avoid redundant API calls

Decision Framework

Use OpenAI when:

Need highest quality embeddings
Low to medium volume (<10M tokens/month)
Prefer managed service over self-hosting
Working with latest models

Use Cohere when:

Need multilingual support
Require production SLA
Want embedding customization
Need both embedding and reranking

Use HuggingFace/Local when:

High volume (>10M tokens/month)
Data privacy requirements
Have GPU infrastructure
Cost optimization priority
Offline/air-gapped environments

References

Sentence Transformers: https://www.sbert.net/
OpenAI Embeddings: https://platform.openai.com/docs/guides/embeddings
Cohere Embeddings: https://docs.cohere.com/docs/embeddings
MTEB Leaderboard: https://huggingface.co/spaces/mteb/leaderboard

embedding-models