From firecrawl-pack
Selects and implements Firecrawl architectures for scraping: on-demand for single pages, scheduled pipelines for monitoring, real-time for high-volume RAG. Includes decision matrix and TypeScript examples.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin firecrawl-packThis skill is limited to using the following tools:
Three deployment architectures for Firecrawl at different scales: on-demand scraping for simple use cases, scheduled crawl pipelines for content monitoring, and real-time ingestion pipelines for AI/RAG applications. Choose based on volume, latency requirements, and cost budget.
Implements Firecrawl reference architecture using scrape/crawl/map/extract pipelines for web content ingestion in AI/RAG applications.
Automate web scraping of single pages, site crawling, structured data extraction, and batch URL processing using Firecrawl via Composio integration. Ideal for gathering web data in terminal workflows.
Scrapes single pages or crawls sites using Firecrawl v2.5 API to LLM-ready markdown and structured data. Handles JS rendering, bot bypass, browser automation for dynamic content extraction.
Share bugs, ideas, or general feedback.
Three deployment architectures for Firecrawl at different scales: on-demand scraping for simple use cases, scheduled crawl pipelines for content monitoring, and real-time ingestion pipelines for AI/RAG applications. Choose based on volume, latency requirements, and cost budget.
| Factor | On-Demand | Scheduled Pipeline | Real-Time Pipeline |
|---|---|---|---|
| Volume | < 500/day | 500-10K/day | 10K+/day |
| Latency | Sync (2-10s) | Async (hours) | Async (minutes) |
| Use Case | Single page lookup | Site monitoring | Knowledge base, RAG |
| Credit Control | Per-request | Per-crawl budget | Credit pipeline |
| Complexity | Low | Medium | High |
User Request → Backend API → firecrawl.scrapeUrl → Clean Content → Response
Best for: chatbots, content preview, single-page extraction.
import FirecrawlApp from "@mendable/firecrawl-js";
const firecrawl = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY!,
});
// Simple API endpoint
app.post("/api/scrape", async (req, res) => {
const { url } = req.body;
const result = await firecrawl.scrapeUrl(url, {
formats: ["markdown"],
onlyMainContent: true,
waitFor: 3000,
});
res.json({
title: result.metadata?.title,
content: result.markdown,
url: result.metadata?.sourceURL,
});
});
// With LLM extraction
app.post("/api/extract", async (req, res) => {
const { url, schema } = req.body;
const result = await firecrawl.scrapeUrl(url, {
formats: ["extract"],
extract: { schema },
});
res.json({ data: result.extract });
});
Scheduler (cron) → Crawl Queue → firecrawl.asyncCrawlUrl → Result Store
│
▼
Content Processor → Search Index
Best for: documentation monitoring, content indexing, competitive analysis.
import cron from "node-cron";
interface CrawlTarget {
id: string;
url: string;
maxPages: number;
paths?: string[];
schedule: string; // cron expression
}
const targets: CrawlTarget[] = [
{ id: "docs", url: "https://docs.example.com", maxPages: 100, paths: ["/docs/*"], schedule: "0 2 * * *" },
{ id: "blog", url: "https://blog.example.com", maxPages: 50, schedule: "0 4 * * 1" },
];
// Schedule crawls
for (const target of targets) {
cron.schedule(target.schedule, async () => {
console.log(`Starting scheduled crawl: ${target.id}`);
const job = await firecrawl.asyncCrawlUrl(target.url, {
limit: target.maxPages,
includePaths: target.paths,
scrapeOptions: { formats: ["markdown"], onlyMainContent: true },
});
await db.saveCrawlJob({ targetId: target.id, jobId: job.id, startedAt: new Date() });
});
}
// Separate worker polls for results
async function processPendingCrawls() {
const pending = await db.getPendingCrawlJobs();
for (const job of pending) {
const status = await firecrawl.checkCrawlStatus(job.jobId);
if (status.status === "completed") {
await indexPages(job.targetId, status.data || []);
await db.markComplete(job.id, status.data?.length || 0);
console.log(`Crawl ${job.targetId} complete: ${status.data?.length} pages indexed`);
}
}
}
setInterval(processPendingCrawls, 30000);
URL Sources → Priority Queue → Firecrawl Workers → Content Validation
│
▼
Vector DB + Search Index
│
▼
RAG / AI Pipeline
Best for: AI training data, knowledge base, enterprise content platform.
import PQueue from "p-queue";
class ContentPipeline {
private queue: PQueue;
private firecrawl: FirecrawlApp;
private creditBudget: number;
private creditsUsed = 0;
constructor(concurrency = 5, dailyBudget = 10000) {
this.queue = new PQueue({ concurrency, interval: 1000, intervalCap: 10 });
this.firecrawl = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY! });
this.creditBudget = dailyBudget;
}
async ingest(urls: string[]) {
if (this.creditsUsed + urls.length > this.creditBudget) {
throw new Error("Daily credit budget exceeded");
}
// Use batch scrape for efficiency
const result = await this.queue.add(() =>
this.firecrawl.batchScrapeUrls(urls, {
formats: ["markdown"],
onlyMainContent: true,
})
);
this.creditsUsed += urls.length;
// Validate and process
const pages = (result?.data || []).filter(page => {
const md = page.markdown || "";
return md.length > 100 && !/captcha|access denied/i.test(md);
});
// Store in vector DB
for (const page of pages) {
await vectorStore.upsert({
id: page.metadata?.sourceURL,
content: page.markdown,
metadata: { title: page.metadata?.title, url: page.metadata?.sourceURL },
});
}
return { ingested: pages.length, rejected: urls.length - pages.length };
}
async discover(siteUrl: string, pathFilter: string) {
const map = await this.firecrawl.mapUrl(siteUrl);
return (map.links || []).filter(url => url.includes(pathFilter));
}
}
// Usage
const pipeline = new ContentPipeline(5, 10000);
const urls = await pipeline.discover("https://docs.example.com", "/api/");
const result = await pipeline.ingest(urls.slice(0, 100));
console.log(`Ingested ${result.ingested} pages into vector store`);
Need real-time, user-facing response?
├── YES → On-Demand (Architecture 1)
└── NO → How many pages/day?
├── < 500 → On-Demand with caching
├── 500-10K → Scheduled Pipeline (Architecture 2)
└── 10K+ → Real-Time Pipeline (Architecture 3)
| Issue | Cause | Solution |
|---|---|---|
| Slow on-demand response | JS-heavy target page | Add caching layer, reduce waitFor |
| Stale indexed content | Crawl schedule too infrequent | Increase frequency for critical sources |
| Credit overrun | Pipeline ingesting too aggressively | Implement daily budget with hard cap |
| Duplicate content | Re-crawling same pages | Deduplicate by content hash before indexing |
For common pitfalls, see firecrawl-known-pitfalls.