Scrapes daily YC job listings from workatastartup.com without duplicates using Playwright auth, Inertia.js JSON extraction, HTML fallback, and SQLite deduplication. Useful for updating startup job databases.
npx claudepluginhub varnan-tech/opendirectory --plugin opendirectory-gtm-skillsThis skill uses the workspace's default tool permissions.
This skill provides a robust architecture for scraping jobs from YCombinator and `workatastartup.com`. It is designed to run automatically, bypass login bottlenecks, and maintain state to never scrape duplicate jobs.
Automatically scrapes websites by analyzing page structure, handling pagination/anti-blocking, discovering article series using Playwright and Crawl4AI. Zero config needed.
Builds production-ready web scrapers for any website using Bright Data APIs including Web Unlocker, Browser, and SERP. Guides site analysis, selector extraction, pagination handling, and code implementation in Python or Node.js.
Build a fully automated AI-powered data collection agent for any public source — job boards, prices, news, GitHub, sports, anything. Scrapes on a schedule, enriches data with a free LLM (Gemini Flash), stores results in Notion/Sheets/Supabase, and learns from user feedback. Runs 100% free on GitHub Actions. Use when the user wants to monitor, collect, or track any public data automatically.
Share bugs, ideas, or general feedback.
This skill provides a robust architecture for scraping jobs from YCombinator and workatastartup.com. It is designed to run automatically, bypass login bottlenecks, and maintain state to never scrape duplicate jobs.
The scraper uses a hybrid approach to maximize reliability and minimize bot detection:
scripts/auth.js uses Playwright to let a human log in once and saves the session to scripts/state.json.scripts/db.js uses better-sqlite3 to manage scripts/jobs.db. It tracks every company_slug and job_id ever seen.scripts/scraper.js loads state.json, visits YC query URLs, and extracts company slugs from the hidden Inertia.js data-page JSON payload./companies/[slug]) to extract jobs from the backend JSON payload to ensure we get the real job_id for accurate deduplication.ycombinator.com/companies/[slug]/jobs.If this is the first time running the scraper in an environment, or if node_modules is missing:
cd @path/scripts
npm install
npx playwright install
If scripts/state.json is missing or expired, the scraper will fail. You must instruct the human user to run the authentication script manually:
cd @path/scripts
node auth.js
Tell the user a browser will open, and they must log in. Playwright will automatically save the cookies/tokens to state.json.
To scrape for new companies and jobs:
cd @path/scripts
node scraper.js
This script will output exactly how many new companies and new jobs were found. Because of jobs.db, running it multiple times consecutively will result in 0 new jobs found.
If you need to analyze the scraped data or view the companies/jobs, you can query scripts/jobs.db directly using better-sqlite3.
Example: Count Companies
cd @path/scripts
node -e "const db = require('better-sqlite3')('jobs.db'); console.log('Companies:', db.prepare('SELECT COUNT(*) as count FROM companies').get().count);"
Example: View Recent Jobs
cd @path/scripts
node -e "const db = require('better-sqlite3')('jobs.db'); const jobs = db.prepare('SELECT title, company_slug, location FROM jobs ORDER BY created_at DESC LIMIT 5').all(); console.table(jobs);"