<!-- AUTO-GENERATED by export-plugins.py — DO NOT EDIT -->
name: enterprise-search
description: Design cross-tool knowledge retrieval strategies, architect enterprise search systems, and tune relevance models. Use when building internal search experiences, consolidating knowledge across tools, or improving search result quality.
tags: [search, knowledge-management, information-retrieval]
Enterprise Search
Business-oriented framework for designing cross-tool knowledge retrieval, architecting enterprise search systems, and tuning relevance models. Focused on strategy and requirements — for technical implementation, see hybrid-search-implementation and similarity-search-patterns in the ai-ml domain.
Use this skill when
- Designing an enterprise search strategy across multiple internal tools (Confluence, Slack, Drive, SharePoint, GitHub)
- Choosing between federated, centralized, or hybrid search architectures
- Defining relevance tuning requirements and quality metrics
- Building a knowledge taxonomy or metadata schema for searchable content
- Creating search UX requirements for internal portals
- Evaluating search quality and measuring improvement
Do not use this skill when
- Implementing vector search or embeddings at code level (use
hybrid-search-implementation)
- Building similarity search with specific vector databases (use
similarity-search-patterns)
- Optimizing web SEO for external search engines (use
seo-audit)
- Building RAG pipelines for LLM applications (use RAG skills in ai-ml domain)
Instructions
- Audit current state — inventory all content sources, volumes, and access patterns.
- Choose architecture — federated, centralized, or hybrid based on your constraints.
- Design taxonomy — define metadata schema, facets, and tagging standards.
- Define relevance model — scoring factors, boosting rules, and personalization signals.
- Set quality metrics — establish baselines and targets for search quality.
- Design search UX — autocomplete, facets, snippets, and result presentation.
Search Architecture Patterns
Architecture Comparison
| Pattern | How It Works | Pros | Cons | Best For |
|---|
| Federated | Query multiple sources in real-time, merge results | No data duplication, real-time freshness | Slower, limited cross-source ranking | Small orgs (<500 people), few sources |
| Centralized | Ingest all content into single search index | Best relevance, fastest queries, unified ranking | Data duplication, sync complexity, stale content risk | Large orgs, search-critical workflows |
| Hybrid | Centralized index for primary sources + federated for long-tail | Balanced cost vs. quality | Most complex to maintain | Mid-to-large orgs with diverse source landscape |
Connector Architecture
| Source | Connector Type | Sync Method | Typical Latency |
|---|
| Confluence / Wiki | REST API | Incremental (webhook + poll) | Near real-time |
| Slack / Teams | Events API | Streaming | Real-time |
| Google Drive | Drive API + Changes API | Incremental | 5-15 min |
| SharePoint | Graph API | Delta query | 5-15 min |
| GitHub | Webhooks + REST API | Event-driven | Near real-time |
| Jira / Linear | REST API + Webhooks | Incremental | Near real-time |
| Email | Graph API / Gmail API | Incremental | 15-30 min |
| Database | CDC (Change Data Capture) | Streaming | Near real-time |
Access Control
Critical requirement: Search results must respect source-level permissions.
| Approach | How It Works | Trade-off |
|---|
| Early binding | Filter at index time (only index what user can access) | Secure but requires per-user indices or ACL tagging |
| Late binding | Filter at query time (check permissions on each result) | Simpler indexing but slower queries at scale |
| Hybrid | Group-based ACL at index + user-level check at query | Best balance for most orgs |
Knowledge Taxonomy Design
Metadata Schema
Every indexed document should carry these metadata fields:
| Field | Type | Purpose | Example |
|---|
title | string | Primary display and search field | "Q4 Revenue Report" |
source | enum | Origin system | confluence, slack, drive, github |
content_type | enum | Document classification | document, conversation, code, ticket |
team | string | Owning team or department | "Engineering", "Sales" |
created_at | datetime | For freshness scoring | 2026-01-15T10:30:00Z |
updated_at | datetime | For freshness and deduplication | 2026-02-28T14:00:00Z |
author | string | For personalization and credibility | "jane.doe@company.com" |
access_groups | list[string] | For permission filtering | ["engineering", "all-staff"] |
tags | list[string] | For faceted navigation | ["architecture", "adr", "database"] |
status | enum | Content lifecycle | draft, published, archived |
Tagging Standards
| Rule | Rationale |
|---|
| Use controlled vocabulary (not free-text tags) | Prevents tag proliferation and inconsistency |
| Max 5 tags per document | Forces specificity over over-tagging |
| Tags use kebab-case | Consistency with URLs and search queries |
| Review tag taxonomy quarterly | Remove unused tags, merge synonyms |
| Auto-tag where possible | Use classification models to suggest tags on creation |
Content Freshness Policies
| Content Type | Freshness Target | Stale Threshold | Action When Stale |
|---|
| Documentation | Updated quarterly | >6 months | Flag for review |
| Meeting notes | Permanent | N/A | Reduce ranking weight over time |
| Code / PRs | Always current (live sync) | N/A | N/A |
| Tickets / Issues | Live sync | N/A | Archive closed items after 12 months |
| Policies / Runbooks | Updated semi-annually | >12 months | Alert content owner |
Relevance Tuning Framework
Scoring Factors
| Factor | Weight | Description |
|---|
| Text relevance (BM25) | 40% | Keyword match quality — title, body, tags |
| Freshness | 20% | More recent content ranked higher (decay function) |
| Popularity | 15% | View count, link count, citation count |
| Personalization | 15% | User's team, recent searches, frequently accessed sources |
| Source authority | 10% | Official docs > Slack messages > personal notes |
Field Boosting
| Field | Boost Factor | Rationale |
|---|
| Title | 3.0x | Titles are the strongest relevance signal |
| Tags | 2.0x | Curated metadata is high-signal |
| Headings (H1-H3) | 1.5x | Section headers indicate topic boundaries |
| Body text | 1.0x | Baseline — full content match |
| Comments | 0.5x | Noisy, often tangential |
Query Understanding
| Technique | Purpose | Example |
|---|
| Synonym expansion | Match equivalent terms | "deploy" → "deploy, release, ship" |
| Spell correction | Handle typos | "kuberntes" → "kubernetes" |
| Intent classification | Route to specialized search | "how do I deploy" → tutorial filter |
| Entity recognition | Boost specific entities | "John's PR for auth" → person + code filter |
Search Quality Metrics
Core Metrics
| Metric | Formula | Target | How to Measure |
|---|
| MRR (Mean Reciprocal Rank) | Average of 1/rank of first relevant result | >0.6 | Relevance judgments on sample queries |
| NDCG@10 | Normalized discounted cumulative gain at position 10 | >0.7 | Graded relevance judgments |
| Precision@5 | % of top 5 results that are relevant | >60% | Binary relevance judgments |
| Zero-Result Rate | % of queries returning no results | <5% | Log analysis |
| Click-Through Rate | % of searches that result in a click | >40% | Click tracking |
| Query Reformulation Rate | % of searches followed by a refined query | <20% | Session analysis |
| Time to Result | p50 and p95 query latency | p50 <200ms, p95 <1s | Infrastructure monitoring |
Quality Improvement Loop
1. Sample 100 queries weekly from search logs
2. Have 2+ raters judge relevance of top 10 results (0-3 scale)
3. Calculate MRR, NDCG@10, Precision@5
4. Identify failure patterns (categories of bad results)
5. Adjust relevance model (boosting, synonyms, freshness weights)
6. A/B test changes against baseline
7. Repeat monthly
Search UX Patterns
| Pattern | Purpose | Implementation Notes |
|---|
| Autocomplete | Reduce typing, guide to known content | Suggest from titles, tags, and popular queries |
| Faceted navigation | Filter by source, type, team, date | Show counts per facet; update dynamically |
| Snippets / Highlights | Show matching content in context | Highlight query terms in 2-3 sentence excerpts |
| Related queries | Help users refine or explore | "People also searched for..." based on co-occurrence |
| Source badges | Indicate content origin | Confluence icon, Slack icon, etc. |
| Freshness indicator | Show content age | "Updated 2 days ago" vs. "Updated 2 years ago" |
| "Did you mean?" | Handle typos gracefully | Only suggest when confidence >80% |
Output Template: Enterprise Search Requirements Document
# Enterprise Search Requirements — [Project Name]
## Current State
- **Content sources:** [list with estimated volumes]
- **Current search tools:** [what people use today]
- **Top pain points:** [from user interviews]
## Architecture Decision
- **Pattern:** [Federated / Centralized / Hybrid]
- **Rationale:** [why this pattern]
- **Search platform:** [Elasticsearch, Typesense, Algolia, Vespa, etc.]
## Scope (Phase 1)
- **Sources to index:** [list with priority]
- **Content types:** [documents, conversations, code, tickets]
- **Users:** [target audience and access model]
## Relevance Model
- **Scoring factors:** [weights per factor]
- **Field boosting:** [title, tags, headings, body]
- **Freshness decay:** [function and parameters]
## Quality Targets
| Metric | Baseline | Target |
|--------|----------|--------|
| MRR | [current] | [goal] |
| Zero-result rate | [current] | <5% |
| p95 latency | [current] | <1s |
## Roadmap
- Phase 1: [Core sources, basic search] — [timeline]
- Phase 2: [Additional sources, relevance tuning] — [timeline]
- Phase 3: [Personalization, AI-powered features] — [timeline]
Common Mistakes
- Indexing everything without curation — more content does not mean better search; noisy sources dilute quality
- Ignoring access control — leaking confidential documents through search is a security incident
- No freshness weighting — returning 3-year-old docs before this week's update frustrates users
- Not measuring search quality — if you don't measure MRR/NDCG, you can't improve
- Building search without user research — understand what people actually search for before designing the system
- Treating search as a one-time project — relevance tuning is ongoing; plan for continuous improvement
Additional Resources
- Related skills:
hybrid-search-implementation (ai-ml — technical implementation), similarity-search-patterns (ai-ml — vector search)
- Elasticsearch / OpenSearch — open-source search engines
- Algolia — managed search platform
- Vespa — open-source search and recommendation engine
<!-- Source: .faos/custom/skills/business/enterprise-search/SKILL.md -->