Skill

scrape-posts

Install
1
Install the plugin
$
npx claudepluginhub melodic-software/claude-code-plugins --plugin milan-jovanovic

Want just this skill?

Add to a custom plugin, then install with one command.

Description

Scrape new articles from Milan Jovanovic's blog (November 2025+). Optimized - pre-filters from listing page, only scrapes new articles.

Tool Access

This skill is limited to using the following tools:

ReadBashSkillmcp__firecrawl__firecrawl_scrapemcp__firecrawl__firecrawl_mapmcp__firecrawl__firecrawl_search
Skill Content

Scrape Milan Jovanovic Blog Posts

Scrape new articles from Milan Jovanovic's .NET blog with optimized pre-filtering. Parses dates from listing page to avoid unnecessary per-article scraping.

Arguments

  • --force: Re-scrape all articles (compare content hash to skip unchanged)
  • --since YYYY-MM-DD: Custom date filter (default: 2025-11-01)
  • --limit N: Limit number of articles (for testing)
  • --dry-run: Preview what would be scraped without saving

Optimized Workflow

Step 1: Invoke Skill

Invoke the milan-jovanovic:milan-jovanovic-blog skill to load context and access scripts.

Step 2: Pre-Filter from Listing Page (OPTIMIZATION)

Key efficiency optimization: Parse dates from listing page BEFORE scraping individual articles.

  1. Scrape the blog listing page using firecrawl_scrape:

    URL: https://www.milanjovanovic.tech/blog
    Format: markdown
    
  2. Save listing content to temp file (e.g., .claude/temp/milan-listing.md)

  3. Run pre-filter script to identify articles needing scraping:

    # Normal mode - only new articles
    python scripts/core/check_new_articles.py .claude/temp/milan-listing.md --json --since 2025-11-01
    
    # Force mode - include existing for re-check
    python scripts/core/check_new_articles.py .claude/temp/milan-listing.md --json --force --since 2025-11-01
    
  4. Parse JSON output to get to_scrape list. If empty, skip to Step 5 (no scraping needed).

Step 3: Scrape Only Needed Articles

For each article in to_scrape:

  1. For articles with in_index: false (new):

    • Scrape full article with firecrawl_scrape
    • Extract publication date from metadata
    • Clean promotional content
    • Save to canonical/milanjovanovic-tech/blog/{slug}.md
  2. For articles with in_index: true (force mode re-check):

    • Scrape full article with firecrawl_scrape
    • Clean promotional content
    • Generate content hash
    • Compare to content_hash from pre-filter output
    • If unchanged, skip writing (log as "skipped - unchanged")
    • If changed, save updated content

Step 4: Update Index

After scraping completes:

python scripts/management/refresh_index.py

Step 5: Report Statistics

Report:

  • Articles found on listing page
  • Articles needing scraping (new + force re-check)
  • Articles skipped (already indexed, not in force mode)
  • Articles skipped (unchanged content hash, force mode)
  • Articles filtered (before cutoff date)
  • Any errors

Content Cleanup Patterns

The scraper removes these promotional patterns:

Footer patterns (stop processing):

  • "Whenever you're ready, there are X ways I can help you"
  • "Become a Better .NET Software Engineer"
  • "Hi, I'm Milan"

Sponsor patterns (remove section):

  • AuthKit/WorkOS mentions
  • "Sponsor this newsletter" links
  • Incident response sponsor content

Inline patterns (remove):

  • Reading time ("5 min read")
  • "Manage read history" links
  • Empty image placeholders

Efficiency Gains

ScenarioWithout OptimizationWith Optimization
No new articles10+ firecrawl requests1-2 requests
1 new article10+ firecrawl requests2-3 requests
Force (unchanged)10+ requests10+ requests but skips writes

Why this matters: Firecrawl has API costs and rate limits. Pre-filtering saves 80-90% of requests when articles haven't changed.

Example Usage

/milan-jovanovic:scrape-posts
/milan-jovanovic:scrape-posts --limit 3 --dry-run
/milan-jovanovic:scrape-posts --force
/milan-jovanovic:scrape-posts --since 2025-12-01

Troubleshooting

Firecrawl Not Available

If firecrawl MCP is not connected, the command will fail. Ensure the firecrawl MCP server is configured and running.

Date Parsing Issues

If listing page dates can't be parsed, the script logs them in no_date category. These articles are skipped unless you provide a specific URL.

Pre-Filter Shows 0 Articles

If check_new_articles.py shows 0 articles to scrape:

  • All articles are already indexed (use --force to re-check)
  • All articles are before the cutoff date (adjust --since)
  • Listing page format changed (check regex patterns in script)
Stats
Stars40
Forks6
Last CommitMar 17, 2026
Actions

Similar Skills

cache-components

Expert guidance for Next.js Cache Components and Partial Prerendering (PPR). **PROACTIVE ACTIVATION**: Use this skill automatically when working in Next.js projects that have `cacheComponents: true` in their next.config.ts/next.config.js. When this config is detected, proactively apply Cache Components patterns and best practices to all React Server Component implementations. **DETECTION**: At the start of a session in a Next.js project, check for `cacheComponents: true` in next.config. If enabled, this skill's patterns should guide all component authoring, data fetching, and caching decisions. **USE CASES**: Implementing 'use cache' directive, configuring cache lifetimes with cacheLife(), tagging cached data with cacheTag(), invalidating caches with updateTag()/revalidateTag(), optimizing static vs dynamic content boundaries, debugging cache issues, and reviewing Cache Component implementations.

138.4k