Use when search queries need better recall through query expansion - generates multiple query variants, retrieves with each, and fuses results using RRF for improved retrieval quality especially with ambiguous or under-specified queries
Generates multiple query variants and fuses results using RRF to improve recall for ambiguous or under-specified queries. Use when searches need better coverage through query expansion.
/plugin marketplace add juanre/llmemory/plugin install llmemory@juanre-ai-toolsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
uv add llmemory
# or
pip install llmemory
Multi-query expansion improves search recall by:
Two expansion modes:
When to use multi-query expansion:
When NOT to use:
from llmemory import LLMemory, SearchType
async with LLMemory(connection_string="postgresql://localhost/mydb") as memory:
# Enable query expansion
results = await memory.search(
owner_id="workspace-1",
query_text="improve customer satisfaction",
search_type=SearchType.HYBRID,
query_expansion=True, # Enable expansion
max_query_variants=3, # Generate 3 variants
limit=10
)
# Results are from all 3 query variants, fused with RRF
for result in results:
print(f"[{result.rrf_score:.3f}] {result.content[:80]}...")
Signature:
async def search(
owner_id: str,
query_text: str,
search_type: Union[SearchType, str] = SearchType.HYBRID,
limit: int = 10,
query_expansion: Optional[bool] = None,
max_query_variants: Optional[int] = None,
**kwargs
) -> List[SearchResult]
Query Expansion Parameters:
query_expansion (bool, optional): Enable/disable query expansion
None (default): Follow global config (LLMEMORY_ENABLE_QUERY_EXPANSION)True: Force enable for this searchFalse: Force disable for this searchmax_query_variants (int, optional): Maximum query variants to generate
search.max_query_variants)Returns:
List[SearchResult] with multi-query specific behavior:
rrf_score reflects consensus across variantsExample:
# Basic multi-query search
results = await memory.search(
owner_id="workspace-1",
query_text="reduce server latency",
search_type=SearchType.HYBRID,
query_expansion=True,
max_query_variants=3,
limit=15
)
# Variants might be:
# 1. "reduce server latency" (original)
# 2. "improve server response time"
# 3. "optimize backend performance"
# All 3 variants are searched, results are fused
llmemory generates query variants using heuristic rules (no LLM required by default):
Original Query: "customer retention strategies"
Generated Variants:
1. "customer retention strategies" (original, always included)
2. "customer OR retention OR strategies" (OR variant - widens lexical recall)
3. "\"customer retention strategies\"" (quoted phrase - exact match)
Each variant captures a different matching strategy:
- Original: Standard BM25 matching
- OR variant: Boolean OR to catch documents with any key term
- Quoted phrase: Exact phrase matching for precision
With stopwords:
Original Query: "how to improve the customer satisfaction"
Generated Variants:
1. "how to improve the customer satisfaction" (original)
2. "how improve customer satisfaction" (keyword variant - stopwords removed)
3. "how OR improve OR customer OR satisfaction" (OR variant)
4. "\"how to improve the customer satisfaction\"" (quoted phrase)
# Internally, multi-query does:
# 1. Generate variants
variants = [
"customer retention strategies", # original
"customer OR retention OR strategies", # OR variant
"\"customer retention strategies\"" # quoted phrase
]
# 2. Search with each variant (executed sequentially)
results_1 = await search(query=variants[0], ...)
results_2 = await search(query=variants[1], ...)
results_3 = await search(query=variants[2], ...)
# 3. Fuse results using RRF
final_results = rrf_fusion([results_1, results_2, results_3])
Reciprocal Rank Fusion combines results from multiple query variants:
For each query variant:
For each result in that variant:
score = 1 / (k + rank + 1)
For each unique chunk (by chunk_id):
total_score = sum of scores from all variants
Sort by total_score descending
Where k = 50 (constant that prevents top-ranked results from dominating). Note the + 1 ensures the first result (rank=0) gets score 1/(50+0+1) = 1/51 rather than 1/50.
Key insight: Chunks appearing in multiple variant result sets get higher RRF scores, indicating strong relevance consensus.
# Original query is vague
results = await memory.search(
owner_id="support-team",
query_text="login problems",
search_type=SearchType.HYBRID,
query_expansion=True,
max_query_variants=3,
limit=20
)
# Variants use different matching strategies:
# - "login problems" (original)
# - "login OR problems" (OR variant - widens recall)
# - "\"login problems\"" (exact phrase)
#
# Results include:
# - Documents with both "login" AND "problems" (original)
# - Documents with either "login" OR "problems" (OR variant)
# - Documents with exact phrase "login problems" (quoted variant)
# Technical query benefits from multiple phrasings
results = await memory.search(
owner_id="docs-site",
query_text="async function error handling",
search_type=SearchType.HYBRID,
query_expansion=True,
max_query_variants=4,
limit=15
)
# Variants use different matching:
# - "async function error handling" (original)
# - "async function error handling" (keyword variant - no stopwords here)
# - "async OR function OR error OR handling" (OR variant)
# - "\"async function error handling\"" (exact phrase)
#
# Balances precision (exact phrase) with recall (OR variant)
# Exploratory queries need broad coverage
results = await memory.search(
owner_id="research-db",
query_text="climate change mitigation",
search_type=SearchType.HYBRID,
query_expansion=True,
max_query_variants=5,
limit=25
)
# Variants use different matching:
# - "climate change mitigation" (original)
# - "climate OR change OR mitigation" (OR variant)
# - "\"climate change mitigation\"" (exact phrase)
# Product searches benefit from synonyms and alternatives
results = await memory.search(
owner_id="store-1",
query_text="lightweight laptop for travel",
search_type=SearchType.HYBRID,
query_expansion=True,
max_query_variants=3,
metadata_filter={"category": "electronics"},
limit=20
)
# Variants use different matching:
# - "lightweight laptop for travel" (original)
# - "lightweight laptop travel" (keyword variant - stopwords removed)
# - "lightweight OR laptop OR travel" (OR variant)
# - "\"lightweight laptop for travel\"" (exact phrase)
# Environment variables
LLMEMORY_ENABLE_QUERY_EXPANSION=1
LLMEMORY_MAX_QUERY_VARIANTS=3
from llmemory import LLMemoryConfig
config = LLMemoryConfig()
config.search.enable_query_expansion = True
config.search.max_query_variants = 3
memory = LLMemory(
connection_string="postgresql://localhost/mydb",
config=config
)
For semantic query diversity, enable LLM-based expansion:
Environment Variables:
LLMEMORY_ENABLE_QUERY_EXPANSION=1
LLMEMORY_QUERY_EXPANSION_MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-...
Programmatic Configuration:
from llmemory import LLMemoryConfig
config = LLMemoryConfig()
config.search.enable_query_expansion = True
config.search.query_expansion_model = "gpt-4o-mini" # Enable LLM expansion
config.search.max_query_variants = 3
memory = LLMemory(
connection_string="postgresql://localhost/mydb",
openai_api_key="sk-...",
config=config
)
LLM vs Heuristic Comparison:
| Mode | Latency | Quality | Cost | Use Case |
|---|---|---|---|---|
| Heuristic | <1ms | Good | Free | Default, high-QPS |
| LLM | 50-200ms | Excellent | ~$0.001/query | Quality-critical |
LLM Expansion Example:
# Original: "improve customer retention"
# LLM variants:
# 1. "strategies to reduce customer churn"
# 2. "methods for increasing customer loyalty"
# 3. "how to keep customers from leaving"
results = await memory.search(
owner_id="workspace-1",
query_text="improve customer retention",
query_expansion=True,
max_query_variants=3,
limit=10
)
# Override global config for specific searches
# Force enable (even if globally disabled)
results = await memory.search(
owner_id="workspace-1",
query_text="vague query here",
query_expansion=True, # Force enable
max_query_variants=4,
limit=10
)
# Force disable (even if globally enabled)
results = await memory.search(
owner_id="workspace-1",
query_text="very specific query",
query_expansion=False, # Force disable
limit=10
)
import time
# Without query expansion (fast)
start = time.time()
results = await memory.search(
owner_id="workspace-1",
query_text="test query",
query_expansion=False,
limit=10
)
elapsed_single = (time.time() - start) * 1000
print(f"Single query: {elapsed_single:.2f}ms")
# With query expansion (slower)
start = time.time()
results = await memory.search(
owner_id="workspace-1",
query_text="test query",
query_expansion=True,
max_query_variants=3,
limit=10
)
elapsed_multi = (time.time() - start) * 1000
print(f"Multi-query: {elapsed_multi:.2f}ms")
# Typical overhead:
# - Variant generation: <1ms (heuristic rules, no LLM)
# - Additional searches: 3x search time (executed in sequence)
# - RRF fusion: 5-10ms
# Total overhead: ~3x base search latency + minimal fusion overhead
# Use fewer variants for speed
results = await memory.search(
owner_id="workspace-1",
query_text="query",
query_expansion=True,
max_query_variants=2, # Faster than 3-4 (fewer searches to execute)
limit=10
)
Good use cases:
Avoid for:
# Combine query expansion with reranking for best quality
results = await memory.search(
owner_id="workspace-1",
query_text="machine learning deployment",
search_type=SearchType.HYBRID,
query_expansion=True, # Generate variants
max_query_variants=3,
rerank=True, # Rerank fused results
rerank_top_k=50, # Consider top 50 from RRF
rerank_return_k=15, # Return top 15 after reranking
limit=15
)
# Pipeline:
# 1. Generate 3 query variants
# 2. Search with each variant
# 3. Fuse with RRF (top 50 candidates)
# 4. Rerank top 50 candidates
# 5. Return top 15 after reranking
# Apply filters to all query variants
results = await memory.search(
owner_id="workspace-1",
query_text="quarterly performance",
search_type=SearchType.HYBRID,
query_expansion=True,
max_query_variants=3,
metadata_filter={
"department": "finance",
"year": 2024
},
date_from=datetime(2024, 1, 1),
limit=20
)
# All 3 variants search within filtered documents only
# Tune alpha for all query variants
results = await memory.search(
owner_id="workspace-1",
query_text="customer feedback analysis",
search_type=SearchType.HYBRID,
alpha=0.6, # Applied to all variants
query_expansion=True,
max_query_variants=3,
limit=15
)
# Each variant uses alpha=0.6 for hybrid search
# Results are then fused with RRF
Multi-query search logs include the generated variants in diagnostics:
# After search, variants are logged
# Check application logs or search history for:
# {
# "query_variants": [
# "original query",
# "variant 1",
# "variant 2"
# ],
# "variant_stats": [
# {"query": "original", "result_count": 15, "latency_ms": 45.3},
# {"query": "variant 1", "result_count": 18, "latency_ms": 42.1},
# {"query": "variant 2", "result_count": 12, "latency_ms": 48.7}
# ]
# }
results = await memory.search(
owner_id="workspace-1",
query_text="test",
query_expansion=True,
max_query_variants=3,
limit=10
)
for result in results:
print(f"Chunk: {result.chunk_id}")
print(f" RRF Score: {result.rrf_score:.4f}")
print(f" Content: {result.content[:80]}...")
print()
# Higher RRF scores indicate:
# - Chunk appeared in multiple variant results
# - Chunk ranked highly across variants
# - Strong consensus on relevance
❌ Wrong: Using multi-query for all searches
# Don't enable globally if not needed
results = await memory.search(
owner_id="workspace-1",
query_text="iPhone 14 Pro", # Very specific, doesn't need expansion
query_expansion=True, # Adds latency without benefit
limit=10
)
✅ Right: Use selectively for complex queries
# Enable only when query benefits from expansion
if is_complex_query(query_text):
query_expansion = True
else:
query_expansion = False
results = await memory.search(
owner_id="workspace-1",
query_text=query_text,
query_expansion=query_expansion,
limit=10
)
❌ Wrong: Too many query variants
results = await memory.search(
owner_id="workspace-1",
query_text="test",
query_expansion=True,
max_query_variants=10, # Too many, diminishing returns
limit=10
)
# Latency increases linearly but quality plateaus
✅ Right: Use 2-4 variants for balance
results = await memory.search(
owner_id="workspace-1",
query_text="test",
query_expansion=True,
max_query_variants=3, # Good balance of quality vs speed
limit=10
)
❌ Wrong: Ignoring latency requirements
# Real-time autocomplete endpoint
@app.get("/autocomplete")
async def autocomplete(q: str):
results = await memory.search(
owner_id="workspace-1",
query_text=q,
query_expansion=True, # Too slow for autocomplete!
limit=5
)
return results
✅ Right: Consider latency constraints
# Use multi-query for main search, not autocomplete
@app.get("/autocomplete")
async def autocomplete(q: str):
results = await memory.search(
owner_id="workspace-1",
query_text=q,
query_expansion=False, # Fast single query
limit=5
)
return results
@app.get("/search")
async def search(q: str):
results = await memory.search(
owner_id="workspace-1",
query_text=q,
query_expansion=True, # Quality search with expansion
max_query_variants=3,
limit=20
)
return results
Multi-query uses heuristic rules to generate variants, not LLM-based expansion:
Removes common stopwords to focus on key terms:
from llmemory.query_expansion import DEFAULT_STOPWORDS
# Stopwords include: a, an, and, are, as, at, be, by, for, from, has,
# in, is, it, of, on, or, that, the, to, was, were, will, with
# Example:
# Input: "how to improve the customer satisfaction"
# Output: "how improve customer satisfaction"
#
# Input: "best practices for the database"
# Output: "best practices database"
Enabled by default via config.search.include_keyword_variant = True.
Creates Boolean OR of all non-stopword terms to maximize recall:
# Example:
# Input: "customer retention strategies"
# Output: "customer OR retention OR strategies"
#
# Input: "reduce server latency"
# Output: "reduce OR server OR latency"
Only generated for multi-word queries. Widens recall by matching documents containing ANY of the key terms.
Wraps the query in quotes for exact phrase matching:
# Example:
# Input: "machine learning deployment"
# Output: "\"machine learning deployment\""
#
# Input: "error handling"
# Output: "\"error handling\""
Only generated for multi-word queries. Ensures high precision by requiring exact phrase match.
from llmemory.query_expansion import QueryExpansionService
from llmemory.config import SearchConfig
service = QueryExpansionService(SearchConfig())
# With stopwords
variants = service._heuristic_variants(
"how to improve the customer satisfaction",
include_keywords=True
)
# Returns:
# 1. "how improve customer satisfaction" (keyword variant)
# 2. "how OR improve OR customer OR satisfaction" (OR variant)
# 3. "\"how to improve the customer satisfaction\"" (quoted phrase)
# Without stopwords
variants = service._heuristic_variants(
"machine learning deployment",
include_keywords=True
)
# Returns:
# 1. "machine OR learning OR deployment" (OR variant, no keyword variant since no stopwords)
# 2. "\"machine learning deployment\"" (quoted phrase)
The default implementation uses heuristics, but you can provide a custom LLM callback:
from llmemory.query_expansion import QueryExpansionService, ExpansionCallback
async def my_llm_expander(query: str, max_variants: int) -> list[str]:
"""Custom LLM-based query expansion."""
# Call your LLM here to generate semantic variants
variants = await my_llm.generate_variants(query, max_variants)
return variants
service = QueryExpansionService(
search_config=config.search,
llm_callback=my_llm_expander # Optional custom expansion
)
When llm_callback is provided, it's tried first; heuristics are used as fallback if LLM fails.
def should_expand_query(query_text: str) -> tuple[bool, int]:
"""Decide if query needs expansion and how many variants."""
# Short queries benefit most from expansion
if len(query_text.split()) <= 3:
return True, 4
# Questions and exploratory queries
question_words = ["how", "what", "why", "when", "where", "who"]
if any(word in query_text.lower() for word in question_words):
return True, 3
# Specific queries don't need expansion
if any(char.isdigit() for char in query_text): # Has numbers
return False, 1
if '"' in query_text: # Has quotes (exact phrase)
return False, 1
# Default: moderate expansion
return True, 2
# Use dynamic expansion
query = "how to improve performance"
expand, variants = should_expand_query(query)
results = await memory.search(
owner_id="workspace-1",
query_text=query,
query_expansion=expand,
max_query_variants=variants,
limit=10
)
import random
# Randomly enable/disable for 50% of queries
use_expansion = random.random() < 0.5
results = await memory.search(
owner_id="workspace-1",
query_text=query_text,
query_expansion=use_expansion,
limit=10
)
# Track metrics:
# - Click-through rate
# - Result relevance
# - User satisfaction
# - Search latency
# Compare A (no expansion) vs B (with expansion)
basic-usage - Core search operationshybrid-search - Vector + text hybrid search fundamentalsrag - Using multi-query in RAG systemsmulti-tenant - Multi-tenant isolation patternsExpansion Modes:
query_expansion_model in config.No LLM Required for Default:
Query expansion works out-of-the-box with heuristic rules. No API key or LLM calls needed unless you configure query_expansion_model.
Cost Considerations (LLM mode only): LLM expansion makes 1 API call per search. For high-volume applications with LLM expansion, consider:
Quality vs Speed:
Fallback Behavior: If LLM expansion fails or times out (8s), system automatically falls back to heuristic expansion. Search always completes.
This skill should be used when the user asks to "create a hookify rule", "write a hook rule", "configure hookify", "add a hookify rule", or needs guidance on hookify rule syntax and patterns.