From firecrawl-pack
Selects and implements Firecrawl architectures for scraping: on-demand for single pages, scheduled pipelines for monitoring, real-time for high-volume RAG. Includes decision matrix and TypeScript examples.
How this skill is triggered — by the user, by Claude, or both
Slash command
/firecrawl-pack:firecrawl-architecture-variantsThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Three deployment architectures for Firecrawl at different scales: on-demand scraping for simple use cases, scheduled crawl pipelines for content monitoring, and real-time ingestion pipelines for AI/RAG applications. Choose based on volume, latency requirements, and cost budget.
Three deployment architectures for Firecrawl at different scales: on-demand scraping for simple use cases, scheduled crawl pipelines for content monitoring, and real-time ingestion pipelines for AI/RAG applications. Choose based on volume, latency requirements, and cost budget.
| Factor | On-Demand | Scheduled Pipeline | Real-Time Pipeline |
|---|---|---|---|
| Volume | < 500/day | 500-10K/day | 10K+/day |
| Latency | Sync (2-10s) | Async (hours) | Async (minutes) |
| Use Case | Single page lookup | Site monitoring | Knowledge base, RAG |
| Credit Control | Per-request | Per-crawl budget | Credit pipeline |
| Complexity | Low | Medium | High |
User Request → Backend API → firecrawl.scrapeUrl → Clean Content → Response
Best for: chatbots, content preview, single-page extraction.
import FirecrawlApp from "@mendable/firecrawl-js";
const firecrawl = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY!,
});
// Simple API endpoint
app.post("/api/scrape", async (req, res) => {
const { url } = req.body;
const result = await firecrawl.scrapeUrl(url, {
formats: ["markdown"],
onlyMainContent: true,
waitFor: 3000,
});
res.json({
title: result.metadata?.title,
content: result.markdown,
url: result.metadata?.sourceURL,
});
});
// With LLM extraction
app.post("/api/extract", async (req, res) => {
const { url, schema } = req.body;
const result = await firecrawl.scrapeUrl(url, {
formats: ["extract"],
extract: { schema },
});
res.json({ data: result.extract });
});
Scheduler (cron) → Crawl Queue → firecrawl.asyncCrawlUrl → Result Store
│
▼
Content Processor → Search Index
Best for: documentation monitoring, content indexing, competitive analysis.
import cron from "node-cron";
interface CrawlTarget {
id: string;
url: string;
maxPages: number;
paths?: string[];
schedule: string; // cron expression
}
const targets: CrawlTarget[] = [
{ id: "docs", url: "https://docs.example.com", maxPages: 100, paths: ["/docs/*"], schedule: "0 2 * * *" },
{ id: "blog", url: "https://blog.example.com", maxPages: 50, schedule: "0 4 * * 1" },
];
// Schedule crawls
for (const target of targets) {
cron.schedule(target.schedule, async () => {
console.log(`Starting scheduled crawl: ${target.id}`);
const job = await firecrawl.asyncCrawlUrl(target.url, {
limit: target.maxPages,
includePaths: target.paths,
scrapeOptions: { formats: ["markdown"], onlyMainContent: true },
});
await db.saveCrawlJob({ targetId: target.id, jobId: job.id, startedAt: new Date() });
});
}
// Separate worker polls for results
async function processPendingCrawls() {
const pending = await db.getPendingCrawlJobs();
for (const job of pending) {
const status = await firecrawl.checkCrawlStatus(job.jobId);
if (status.status === "completed") {
await indexPages(job.targetId, status.data || []);
await db.markComplete(job.id, status.data?.length || 0);
console.log(`Crawl ${job.targetId} complete: ${status.data?.length} pages indexed`);
}
}
}
setInterval(processPendingCrawls, 30000);
URL Sources → Priority Queue → Firecrawl Workers → Content Validation
│
▼
Vector DB + Search Index
│
▼
RAG / AI Pipeline
Best for: AI training data, knowledge base, enterprise content platform.
import PQueue from "p-queue";
class ContentPipeline {
private queue: PQueue;
private firecrawl: FirecrawlApp;
private creditBudget: number;
private creditsUsed = 0;
constructor(concurrency = 5, dailyBudget = 10000) {
this.queue = new PQueue({ concurrency, interval: 1000, intervalCap: 10 });
this.firecrawl = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY! });
this.creditBudget = dailyBudget;
}
async ingest(urls: string[]) {
if (this.creditsUsed + urls.length > this.creditBudget) {
throw new Error("Daily credit budget exceeded");
}
// Use batch scrape for efficiency
const result = await this.queue.add(() =>
this.firecrawl.batchScrapeUrls(urls, {
formats: ["markdown"],
onlyMainContent: true,
})
);
this.creditsUsed += urls.length;
// Validate and process
const pages = (result?.data || []).filter(page => {
const md = page.markdown || "";
return md.length > 100 && !/captcha|access denied/i.test(md);
});
// Store in vector DB
for (const page of pages) {
await vectorStore.upsert({
id: page.metadata?.sourceURL,
content: page.markdown,
metadata: { title: page.metadata?.title, url: page.metadata?.sourceURL },
});
}
return { ingested: pages.length, rejected: urls.length - pages.length };
}
async discover(siteUrl: string, pathFilter: string) {
const map = await this.firecrawl.mapUrl(siteUrl);
return (map.links || []).filter(url => url.includes(pathFilter));
}
}
// Usage
const pipeline = new ContentPipeline(5, 10000);
const urls = await pipeline.discover("https://docs.example.com", "/api/");
const result = await pipeline.ingest(urls.slice(0, 100));
console.log(`Ingested ${result.ingested} pages into vector store`);
Need real-time, user-facing response?
├── YES → On-Demand (Architecture 1)
└── NO → How many pages/day?
├── < 500 → On-Demand with caching
├── 500-10K → Scheduled Pipeline (Architecture 2)
└── 10K+ → Real-Time Pipeline (Architecture 3)
| Issue | Cause | Solution |
|---|---|---|
| Slow on-demand response | JS-heavy target page | Add caching layer, reduce waitFor |
| Stale indexed content | Crawl schedule too infrequent | Increase frequency for critical sources |
| Credit overrun | Pipeline ingesting too aggressively | Implement daily budget with hard cap |
| Duplicate content | Re-crawling same pages | Deduplicate by content hash before indexing |
For common pitfalls, see firecrawl-known-pitfalls.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin firecrawl-packImplements Firecrawl reference architecture using scrape/crawl/map/extract pipelines for web content ingestion in AI/RAG applications.
Automates web crawling and data extraction using Firecrawl: scrape pages, crawl sites, extract structured data with AI, batch URLs, and map site structures.
Scrapes single pages or crawls sites using Firecrawl v2.5 API to LLM-ready markdown and structured data. Handles JS rendering, bot bypass, browser automation for dynamic content extraction.