Enterprise Search | faos-architect

Stats

Actions

Tags

Enterprise Search | faos-architect

name: enterprise-search description: Design cross-tool knowledge retrieval strategies, architect enterprise search systems, and tune relevance models. Use when building internal search experiences, consolidating knowledge across tools, or improving search result quality. tags: [search, knowledge-management, information-retrieval]

Enterprise Search

Business-oriented framework for designing cross-tool knowledge retrieval, architecting enterprise search systems, and tuning relevance models. Focused on strategy and requirements — for technical implementation, see hybrid-search-implementation and similarity-search-patterns in the ai-ml domain.

Use this skill when

Designing an enterprise search strategy across multiple internal tools (Confluence, Slack, Drive, SharePoint, GitHub)
Choosing between federated, centralized, or hybrid search architectures
Defining relevance tuning requirements and quality metrics
Building a knowledge taxonomy or metadata schema for searchable content
Creating search UX requirements for internal portals
Evaluating search quality and measuring improvement

Do not use this skill when

Implementing vector search or embeddings at code level (use hybrid-search-implementation)
Building similarity search with specific vector databases (use similarity-search-patterns)
Optimizing web SEO for external search engines (use seo-audit)
Building RAG pipelines for LLM applications (use RAG skills in ai-ml domain)

Instructions

Audit current state — inventory all content sources, volumes, and access patterns.
Choose architecture — federated, centralized, or hybrid based on your constraints.
Design taxonomy — define metadata schema, facets, and tagging standards.
Define relevance model — scoring factors, boosting rules, and personalization signals.
Set quality metrics — establish baselines and targets for search quality.
Design search UX — autocomplete, facets, snippets, and result presentation.

Search Architecture Patterns

Architecture Comparison

Pattern	How It Works	Pros	Cons	Best For
Federated	Query multiple sources in real-time, merge results	No data duplication, real-time freshness	Slower, limited cross-source ranking	Small orgs (<500 people), few sources
Centralized	Ingest all content into single search index	Best relevance, fastest queries, unified ranking	Data duplication, sync complexity, stale content risk	Large orgs, search-critical workflows
Hybrid	Centralized index for primary sources + federated for long-tail	Balanced cost vs. quality	Most complex to maintain	Mid-to-large orgs with diverse source landscape

Connector Architecture

Source	Connector Type	Sync Method	Typical Latency
Confluence / Wiki	REST API	Incremental (webhook + poll)	Near real-time
Slack / Teams	Events API	Streaming	Real-time
Google Drive	Drive API + Changes API	Incremental	5-15 min
SharePoint	Graph API	Delta query	5-15 min
GitHub	Webhooks + REST API	Event-driven	Near real-time
Jira / Linear	REST API + Webhooks	Incremental	Near real-time
Email	Graph API / Gmail API	Incremental	15-30 min
Database	CDC (Change Data Capture)	Streaming	Near real-time

Access Control

Critical requirement: Search results must respect source-level permissions.

Approach	How It Works	Trade-off
Early binding	Filter at index time (only index what user can access)	Secure but requires per-user indices or ACL tagging
Late binding	Filter at query time (check permissions on each result)	Simpler indexing but slower queries at scale
Hybrid	Group-based ACL at index + user-level check at query	Best balance for most orgs

Knowledge Taxonomy Design

Metadata Schema

Every indexed document should carry these metadata fields:

Field	Type	Purpose	Example
`title`	string	Primary display and search field	"Q4 Revenue Report"
`source`	enum	Origin system	confluence, slack, drive, github
`content_type`	enum	Document classification	document, conversation, code, ticket
`team`	string	Owning team or department	"Engineering", "Sales"
`created_at`	datetime	For freshness scoring	2026-01-15T10:30:00Z
`updated_at`	datetime	For freshness and deduplication	2026-02-28T14:00:00Z
`author`	string	For personalization and credibility	"[email protected]"
`access_groups`	list[string]	For permission filtering	["engineering", "all-staff"]
`tags`	list[string]	For faceted navigation	["architecture", "adr", "database"]
`status`	enum	Content lifecycle	draft, published, archived

Tagging Standards

Rule	Rationale
Use controlled vocabulary (not free-text tags)	Prevents tag proliferation and inconsistency
Max 5 tags per document	Forces specificity over over-tagging
Tags use kebab-case	Consistency with URLs and search queries
Review tag taxonomy quarterly	Remove unused tags, merge synonyms
Auto-tag where possible	Use classification models to suggest tags on creation

Content Freshness Policies

Content Type	Freshness Target	Stale Threshold	Action When Stale
Documentation	Updated quarterly	>6 months	Flag for review
Meeting notes	Permanent	N/A	Reduce ranking weight over time
Code / PRs	Always current (live sync)	N/A	N/A
Tickets / Issues	Live sync	N/A	Archive closed items after 12 months
Policies / Runbooks	Updated semi-annually	>12 months	Alert content owner

Relevance Tuning Framework

Scoring Factors

Factor	Weight	Description
Text relevance (BM25)	40%	Keyword match quality — title, body, tags
Freshness	20%	More recent content ranked higher (decay function)
Popularity	15%	View count, link count, citation count
Personalization	15%	User's team, recent searches, frequently accessed sources
Source authority	10%	Official docs > Slack messages > personal notes

Field Boosting

Field	Boost Factor	Rationale
Title	3.0x	Titles are the strongest relevance signal
Tags	2.0x	Curated metadata is high-signal
Headings (H1-H3)	1.5x	Section headers indicate topic boundaries
Body text	1.0x	Baseline — full content match
Comments	0.5x	Noisy, often tangential

Query Understanding

Technique	Purpose	Example
Synonym expansion	Match equivalent terms	"deploy" → "deploy, release, ship"
Spell correction	Handle typos	"kuberntes" → "kubernetes"
Intent classification	Route to specialized search	"how do I deploy" → tutorial filter
Entity recognition	Boost specific entities	"John's PR for auth" → person + code filter

Search Quality Metrics

Core Metrics

Metric	Formula	Target	How to Measure
MRR (Mean Reciprocal Rank)	Average of 1/rank of first relevant result	>0.6	Relevance judgments on sample queries
NDCG@10	Normalized discounted cumulative gain at position 10	>0.7	Graded relevance judgments
Precision@5	% of top 5 results that are relevant	>60%	Binary relevance judgments
Zero-Result Rate	% of queries returning no results	<5%	Log analysis
Click-Through Rate	% of searches that result in a click	>40%	Click tracking
Query Reformulation Rate	% of searches followed by a refined query	<20%	Session analysis
Time to Result	p50 and p95 query latency	p50 <200ms, p95 <1s	Infrastructure monitoring

Quality Improvement Loop

1. Sample 100 queries weekly from search logs
2. Have 2+ raters judge relevance of top 10 results (0-3 scale)
3. Calculate MRR, NDCG@10, Precision@5
4. Identify failure patterns (categories of bad results)
5. Adjust relevance model (boosting, synonyms, freshness weights)
6. A/B test changes against baseline
7. Repeat monthly

Search UX Patterns

Pattern	Purpose	Implementation Notes
Autocomplete	Reduce typing, guide to known content	Suggest from titles, tags, and popular queries
Faceted navigation	Filter by source, type, team, date	Show counts per facet; update dynamically
Snippets / Highlights	Show matching content in context	Highlight query terms in 2-3 sentence excerpts
Related queries	Help users refine or explore	"People also searched for..." based on co-occurrence
Source badges	Indicate content origin	Confluence icon, Slack icon, etc.
Freshness indicator	Show content age	"Updated 2 days ago" vs. "Updated 2 years ago"
"Did you mean?"	Handle typos gracefully	Only suggest when confidence >80%

Output Template: Enterprise Search Requirements Document

# Enterprise Search Requirements — [Project Name]

## Current State
- **Content sources:** [list with estimated volumes]
- **Current search tools:** [what people use today]
- **Top pain points:** [from user interviews]

## Architecture Decision
- **Pattern:** [Federated / Centralized / Hybrid]
- **Rationale:** [why this pattern]
- **Search platform:** [Elasticsearch, Typesense, Algolia, Vespa, etc.]

## Scope (Phase 1)
- **Sources to index:** [list with priority]
- **Content types:** [documents, conversations, code, tickets]
- **Users:** [target audience and access model]

## Relevance Model
- **Scoring factors:** [weights per factor]
- **Field boosting:** [title, tags, headings, body]
- **Freshness decay:** [function and parameters]

## Quality Targets
| Metric | Baseline | Target |
|--------|----------|--------|
| MRR | [current] | [goal] |
| Zero-result rate | [current] | <5% |
| p95 latency | [current] | <1s |

## Roadmap
- Phase 1: [Core sources, basic search] — [timeline]
- Phase 2: [Additional sources, relevance tuning] — [timeline]
- Phase 3: [Personalization, AI-powered features] — [timeline]

Common Mistakes

Indexing everything without curation — more content does not mean better search; noisy sources dilute quality
Ignoring access control — leaking confidential documents through search is a security incident
No freshness weighting — returning 3-year-old docs before this week's update frustrates users
Not measuring search quality — if you don't measure MRR/NDCG, you can't improve
Building search without user research — understand what people actually search for before designing the system
Treating search as a one-time project — relevance tuning is ongoing; plan for continuous improvement

Additional Resources

Related skills: hybrid-search-implementation (ai-ml — technical implementation), similarity-search-patterns (ai-ml — vector search)
Elasticsearch / OpenSearch — open-source search engines
Algolia — managed search platform
Vespa — open-source search and recommendation engine