Skill

scrape-posts

Scrapes new articles from Milan Jovanovic's .NET blog (post-November 2025) using optimized pre-filtering from listing page, Firecrawl scraping, and Python scripts to target only new or changed content.

Python

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/milan-jovanovic:scrape-posts

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadBashSkillmcp__firecrawl__firecrawl_scrapemcp__firecrawl__firecrawl_mapmcp__firecrawl__firecrawl_search

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Scrape new articles from Milan Jovanovic's .NET blog with **optimized pre-filtering**. Parses dates from listing page to avoid unnecessary per-article scraping.

SKILL.md

145 lines · ~1.2k tokens

Stats

LanguagePython

Parent stars67

Parent forks10

MaintenanceExcellent

Last CommitMar 17, 2026

Actions

View Source View Plugin View on GitHub View README

Scrape Milan Jovanovic Blog Posts

Scrape new articles from Milan Jovanovic's .NET blog with optimized pre-filtering. Parses dates from listing page to avoid unnecessary per-article scraping.

Arguments

--force: Re-scrape all articles (compare content hash to skip unchanged)
--since YYYY-MM-DD: Custom date filter (default: 2025-11-01)
--limit N: Limit number of articles (for testing)
--dry-run: Preview what would be scraped without saving

Optimized Workflow

Step 1: Invoke Skill

Invoke the milan-jovanovic:milan-jovanovic-blog skill to load context and access scripts.

Step 2: Pre-Filter from Listing Page (OPTIMIZATION)

Key efficiency optimization: Parse dates from listing page BEFORE scraping individual articles.

Scrape the blog listing page using firecrawl_scrape:

URL: https://www.milanjovanovic.tech/blog
Format: markdown

Save listing content to temp file (e.g., .claude/temp/milan-listing.md)

Run pre-filter script to identify articles needing scraping:

# Normal mode - only new articles
python scripts/core/check_new_articles.py .claude/temp/milan-listing.md --json --since 2025-11-01

# Force mode - include existing for re-check
python scripts/core/check_new_articles.py .claude/temp/milan-listing.md --json --force --since 2025-11-01

Parse JSON output to get to_scrape list. If empty, skip to Step 5 (no scraping needed).

Step 3: Scrape Only Needed Articles

For each article in to_scrape:

For articles with in_index: false (new):
- Scrape full article with firecrawl_scrape
- Extract publication date from metadata
- Clean promotional content
- Save to canonical/milanjovanovic-tech/blog/{slug}.md
For articles with in_index: true (force mode re-check):
- Scrape full article with firecrawl_scrape
- Clean promotional content
- Generate content hash
- Compare to content_hash from pre-filter output
- If unchanged, skip writing (log as "skipped - unchanged")
- If changed, save updated content

Step 4: Update Index

After scraping completes:

python scripts/management/refresh_index.py

Step 5: Report Statistics

Report:

Articles found on listing page
Articles needing scraping (new + force re-check)
Articles skipped (already indexed, not in force mode)
Articles skipped (unchanged content hash, force mode)
Articles filtered (before cutoff date)
Any errors

Content Cleanup Patterns

The scraper removes these promotional patterns:

Footer patterns (stop processing):

"Whenever you're ready, there are X ways I can help you"
"Become a Better .NET Software Engineer"
"Hi, I'm Milan"

Sponsor patterns (remove section):

AuthKit/WorkOS mentions
"Sponsor this newsletter" links
Incident response sponsor content

Inline patterns (remove):

Reading time ("5 min read")
"Manage read history" links
Empty image placeholders

Efficiency Gains

Scenario	Without Optimization	With Optimization
No new articles	10+ firecrawl requests	1-2 requests
1 new article	10+ firecrawl requests	2-3 requests
Force (unchanged)	10+ requests	10+ requests but skips writes

Why this matters: Firecrawl has API costs and rate limits. Pre-filtering saves 80-90% of requests when articles haven't changed.

Example Usage

/milan-jovanovic:scrape-posts
/milan-jovanovic:scrape-posts --limit 3 --dry-run
/milan-jovanovic:scrape-posts --force
/milan-jovanovic:scrape-posts --since 2025-12-01

Troubleshooting

Firecrawl Not Available

If firecrawl MCP is not connected, the command will fail. Ensure the firecrawl MCP server is configured and running.

Date Parsing Issues

If listing page dates can't be parsed, the script logs them in no_date category. These articles are skipped unless you provide a specific URL.

Pre-Filter Shows 0 Articles

If check_new_articles.py shows 0 articles to scrape:

All articles are already indexed (use --force to re-check)
All articles are before the cutoff date (adjust --since)
Listing page format changed (check regex patterns in script)

scrape-posts

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

scrape-posts

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Scrape Milan Jovanovic Blog Posts

Arguments

Optimized Workflow

Step 1: Invoke Skill

Step 2: Pre-Filter from Listing Page (OPTIMIZATION)

Step 3: Scrape Only Needed Articles

Step 4: Update Index

Step 5: Report Statistics

Content Cleanup Patterns

Efficiency Gains

Example Usage

Troubleshooting

Firecrawl Not Available

Date Parsing Issues

Pre-Filter Shows 0 Articles

Similar Skills

Scrape Milan Jovanovic Blog Posts

Arguments

Optimized Workflow

Step 1: Invoke Skill

Step 2: Pre-Filter from Listing Page (OPTIMIZATION)

Step 3: Scrape Only Needed Articles

Step 4: Update Index

Step 5: Report Statistics

Content Cleanup Patterns

Efficiency Gains

Example Usage

Troubleshooting

Firecrawl Not Available

Date Parsing Issues

Pre-Filter Shows 0 Articles

Similar Skills