npx claudepluginhub melodic-software/claude-code-plugins --plugin milan-jovanovicWant just this skill?
Add to a custom plugin, then install with one command.
Scrape new articles from Milan Jovanovic's blog (November 2025+). Optimized - pre-filters from listing page, only scrapes new articles.
This skill is limited to using the following tools:
Scrape Milan Jovanovic Blog Posts
Scrape new articles from Milan Jovanovic's .NET blog with optimized pre-filtering. Parses dates from listing page to avoid unnecessary per-article scraping.
Arguments
--force: Re-scrape all articles (compare content hash to skip unchanged)--since YYYY-MM-DD: Custom date filter (default: 2025-11-01)--limit N: Limit number of articles (for testing)--dry-run: Preview what would be scraped without saving
Optimized Workflow
Step 1: Invoke Skill
Invoke the milan-jovanovic:milan-jovanovic-blog skill to load context and access scripts.
Step 2: Pre-Filter from Listing Page (OPTIMIZATION)
Key efficiency optimization: Parse dates from listing page BEFORE scraping individual articles.
-
Scrape the blog listing page using
firecrawl_scrape:URL: https://www.milanjovanovic.tech/blog Format: markdown -
Save listing content to temp file (e.g.,
.claude/temp/milan-listing.md) -
Run pre-filter script to identify articles needing scraping:
# Normal mode - only new articles python scripts/core/check_new_articles.py .claude/temp/milan-listing.md --json --since 2025-11-01 # Force mode - include existing for re-check python scripts/core/check_new_articles.py .claude/temp/milan-listing.md --json --force --since 2025-11-01 -
Parse JSON output to get
to_scrapelist. If empty, skip to Step 5 (no scraping needed).
Step 3: Scrape Only Needed Articles
For each article in to_scrape:
-
For articles with
in_index: false(new):- Scrape full article with
firecrawl_scrape - Extract publication date from metadata
- Clean promotional content
- Save to
canonical/milanjovanovic-tech/blog/{slug}.md
- Scrape full article with
-
For articles with
in_index: true(force mode re-check):- Scrape full article with
firecrawl_scrape - Clean promotional content
- Generate content hash
- Compare to
content_hashfrom pre-filter output - If unchanged, skip writing (log as "skipped - unchanged")
- If changed, save updated content
- Scrape full article with
Step 4: Update Index
After scraping completes:
python scripts/management/refresh_index.py
Step 5: Report Statistics
Report:
- Articles found on listing page
- Articles needing scraping (new + force re-check)
- Articles skipped (already indexed, not in force mode)
- Articles skipped (unchanged content hash, force mode)
- Articles filtered (before cutoff date)
- Any errors
Content Cleanup Patterns
The scraper removes these promotional patterns:
Footer patterns (stop processing):
- "Whenever you're ready, there are X ways I can help you"
- "Become a Better .NET Software Engineer"
- "Hi, I'm Milan"
Sponsor patterns (remove section):
- AuthKit/WorkOS mentions
- "Sponsor this newsletter" links
- Incident response sponsor content
Inline patterns (remove):
- Reading time ("5 min read")
- "Manage read history" links
- Empty image placeholders
Efficiency Gains
| Scenario | Without Optimization | With Optimization |
|---|---|---|
| No new articles | 10+ firecrawl requests | 1-2 requests |
| 1 new article | 10+ firecrawl requests | 2-3 requests |
| Force (unchanged) | 10+ requests | 10+ requests but skips writes |
Why this matters: Firecrawl has API costs and rate limits. Pre-filtering saves 80-90% of requests when articles haven't changed.
Example Usage
/milan-jovanovic:scrape-posts
/milan-jovanovic:scrape-posts --limit 3 --dry-run
/milan-jovanovic:scrape-posts --force
/milan-jovanovic:scrape-posts --since 2025-12-01
Troubleshooting
Firecrawl Not Available
If firecrawl MCP is not connected, the command will fail. Ensure the firecrawl MCP server is configured and running.
Date Parsing Issues
If listing page dates can't be parsed, the script logs them in no_date category. These articles are skipped unless you provide a specific URL.
Pre-Filter Shows 0 Articles
If check_new_articles.py shows 0 articles to scrape:
- All articles are already indexed (use
--forceto re-check) - All articles are before the cutoff date (adjust
--since) - Listing page format changed (check regex patterns in script)