Skill

data-scraper-agent

From ecc

채용 공고 게시판, 가격, 뉴스, GitHub, 스포츠 등 모든 공개 소스를 위한 완전 자동화된 AI 기반 데이터 수집 에이전트를 구축합니다. 일정에 따라 스크래핑을 수행하고, 무료 LLM(Gemini Flash)으로 데이터를 풍부하게 만들며, 결과를 Notion/Sheets/Supabase에 저장하고 사용자 피드백을 통해 학습합니다. GitHub Actions에서 100% 무료로 실행됩니다. 사용자가 공개 데이터를 자동으로 모니터링, 수집 또는 추적하기를 원할 때 사용합니다.

npx claudepluginhub sam42-lab/everything-claude-code-kr

Tool Access

This skill uses the workspace's default tool permissions.

Preview

모든 공개 데이터 소스를 위한 프로덕션급 AI 기반 데이터 수집 에이전트를 구축합니다.

SKILL.md

Similar Skills

using-superpowers

178.4k

Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.

3 files

superpowers

Stats

Stars0

Forks0

Last CommitApr 12, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

데이터 스크래퍼 에이전트(Data Scraper Agent)

모든 공개 데이터 소스를 위한 프로덕션급 AI 기반 데이터 수집 에이전트를 구축합니다. 일정에 따라 실행되고, 무료 LLM으로 결과를 보완하며, 데이터베이스에 저장하고 시간이 지남에 따라 개선됩니다.

스택: Python · Gemini Flash (무료) · GitHub Actions (무료) · Notion / Sheets / Supabase

활성화 시점

사용자가 공개 웹사이트나 API를 스크래핑하거나 모니터링하기를 원할 때
사용자가 "...를 확인하는 봇을 만들어줘", "나를 위해 X를 모니터링해줘", "...에서 데이터를 수집해줘"라고 말할 때
채용 정보, 가격, 뉴스, 저장소, 스포츠 점수, 이벤트, 매물 등을 추적하고 싶을 때
호스팅 비용 없이 데이터 수집을 자동화하는 방법을 물을 때
사용자의 결정에 따라 시간이 지남에 따라 똑똑해지는 에이전트를 원할 때

핵심 개념

세 가지 레이어

모든 데이터 스크래퍼 에이전트는 세 가지 레이어를 가집니다:

수집(COLLECT) → 보완(ENRICH) → 저장(STORE)
      │               │              │
  스크래퍼         AI (LLM)        데이터베이스
 일정에 따라      점수 매기기/      Notion /
   실행됨         요약 및 분류     Sheets /
                                   Supabase

무료 스택

레이어	도구	이유
스크래핑	`requests` + `BeautifulSoup`	비용 없음, 공개 사이트의 80% 커버
JS 렌더링 사이트	`playwright` (무료)	HTML 스크래핑이 실패할 때 사용
AI 보완	Gemini Flash (REST API 사용)	일일 500회 요청, 100만 토큰 — 무료
저장소	Notion API	무료 티어 제공, 검토하기 좋은 UI
일정(Schedule)	GitHub Actions cron	공개 저장소에서 무료
학습	저장소 내 JSON 피드백 파일	인프라 비용 제로, git에 영구 보관

AI 모델 폴백(Fallback) 체인

할당량 소진 시 Gemini 모델 간에 자동으로 전환되도록 에이전트를 구축합니다:

gemini-2.0-flash-lite (30 RPM) →
gemini-2.0-flash (15 RPM) →
gemini-2.5-flash (10 RPM) →
gemini-flash-lite-latest (최종 폴백)

효율성을 위한 배치(Batch) API 호출

항목당 LLM을 한 번씩 호출하지 마세요. 항상 배치로 처리합니다:

# 나쁨: 33개 항목에 대해 33번의 API 호출
for item in items:
    result = call_ai(item)  # 33번 호출 → 속도 제한에 걸림

# 좋음: 33개 항목에 대해 7번의 API 호출 (배치 크기 5)
for batch in chunks(items, size=5):
    results = call_ai(batch)  # 7번 호출 → 무료 티어 범위 내 유지

워크플로

1단계: 목표 이해하기

사용자에게 다음을 질문하세요:

수집 대상: "어떤 데이터 소스인가요? URL / API / RSS / 공개 엔드포인트?"
추출 항목: "어떤 필드가 중요한가요? 제목, 가격, URL, 날짜, 점수?"
저장 방식: "결과를 어디에 저장할까요? Notion, Google Sheets, Supabase 또는 로컬 파일?"
AI 보완 방식: "AI가 각 항목을 점수 매기기, 요약, 분류 또는 매칭하기를 원하시나요?"
빈도: "얼마나 자주 실행할까요? 매시간, 매일, 매주?"

유도할 수 있는 일반적인 예시:

채용 공고 게시판 → 이력서와의 관련성 점수 매기기
제품 가격 → 가격 하락 시 알림
GitHub 저장소 → 새 릴리스 요약
뉴스 피드 → 주제 + 감성별 분류
스포츠 결과 → 추적기에 통계 추출
이벤트 캘린더 → 관심사에 따라 필터링

2단계: 에이전트 아키텍처 설계

사용자를 위해 이 디렉토리 구조를 생성합니다:

my-agent/
├── config.yaml              # 사용자 정의 설정 (키워드, 필터, 선호도)
├── profile/
│   └── context.md           # AI가 사용하는 사용자 컨텍스트 (이력서, 관심사, 기준)
├── scraper/
│   ├── __init__.py
│   ├── main.py              # 오케스트레이터: 스크래핑 → 보완 → 저장
│   ├── filters.py           # 규칙 기반 사전 필터 (AI 실행 전 빠른 처리)
│   └── sources/
│       ├── __init__.py
│       └── source_name.py   # 데이터 소스당 파일 하나
├── ai/
│   ├── __init__.py
│   ├── client.py            # 모델 폴백이 포함된 Gemini REST 클라이언트
│   ├── pipeline.py          # 배치 AI 분석
│   ├── jd_fetcher.py        # URL에서 전체 콘텐츠 가져오기 (옵션)
│   └── memory.py            # 사용자 피드백으로부터 학습
├── storage/
│   ├── __init__.py
│   └── notion_sync.py       # 또는 sheets_sync.py / supabase_sync.py
├── data/
│   └── feedback.json        # 사용자 결정 이력 (자동 업데이트됨)
├── .env.example
├── setup.py                 # 일회성 DB/스키마 생성
├── enrich_existing.py       # 기존 행에 AI 점수 백필(backfill)
├── requirements.txt
└── .github/
    └── workflows/
        └── scraper.yml      # GitHub Actions 일정

3단계: 스크래퍼 소스 구축

모든 데이터 소스를 위한 템플릿:

# scraper/sources/my_source.py
"""
[Source Name] — [where]에서 [what]을 스크래핑합니다.
방법: [REST API / HTML 스크래핑 / RSS 피드]
"""
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timezone
from scraper.filters import is_relevant

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)",
}


def fetch() -> list[dict]:
    """
    일관된 스키마를 가진 항목 리스트를 반환합니다.
    각 항목은 최소한 name, url, date_found를 가져야 합니다.
    """
    results = []

    # ---- REST API 소스 ----
    resp = requests.get("https://api.example.com/items", headers=HEADERS, timeout=15)
    if resp.status_code == 200:
        for item in resp.json().get("results", []):
            if not is_relevant(item.get("title", "")):
                continue
            results.append(_normalise(item))

    return results


def _normalise(raw: dict) -> dict:
    """원시 API/HTML 데이터를 표준 스키마로 변환합니다."""
    return {
        "name": raw.get("title", ""),
        "url": raw.get("link", ""),
        "source": "MySource",
        "date_found": datetime.now(timezone.utc).date().isoformat(),
        # 여기에 도메인별 필드를 추가하세요
    }

HTML 스크래핑 패턴:

soup = BeautifulSoup(resp.text, "lxml")
for card in soup.select("[class*='listing']"):
    title = card.select_one("h2, h3").get_text(strip=True)
    link = card.select_one("a")["href"]
    if not link.startswith("http"):
        link = f"https://example.com{link}"

RSS 피드 패턴:

import xml.etree.ElementTree as ET
root = ET.fromstring(resp.text)
for item in root.findall(".//item"):
    title = item.findtext("title", "")
    link = item.findtext("link", "")

4단계: Gemini AI 클라이언트 구축

# ai/client.py
import os, json, time, requests

_last_call = 0.0

MODEL_FALLBACK = [
    "gemini-2.0-flash-lite",
    "gemini-2.0-flash",
    "gemini-2.5-flash",
    "gemini-flash-lite-latest",
]


def generate(prompt: str, model: str = "", rate_limit: float = 7.0) -> dict:
    """429 에러 발생 시 자동 폴백과 함께 Gemini를 호출합니다. 파싱된 JSON 또는 {}를 반환합니다."""
    global _last_call

    api_key = os.environ.get("GEMINI_API_KEY", "")
    if not api_key:
        return {}

    elapsed = time.time() - _last_call
    if elapsed < rate_limit:
        time.sleep(rate_limit - elapsed)

    models = [model] + [m for m in MODEL_FALLBACK if m != model] if model else MODEL_FALLBACK
    _last_call = time.time()

    for m in models:
        url = f"https://generativelanguage.googleapis.com/v1beta/models/{m}:generateContent?key={api_key}"
        payload = {
            "contents": [{"parts": [{"text": prompt}]}],
            "generationConfig": {
                "responseMimeType": "application/json",
                "temperature": 0.3,
                "maxOutputTokens": 2048,
            },
        }
        try:
            resp = requests.post(url, json=payload, timeout=30)
            if resp.status_code == 200:
                return _parse(resp)
            if resp.status_code in (429, 404):
                time.sleep(1)
                continue
            return {}
        except requests.RequestException:
            return {}

    return {}


def _parse(resp) -> dict:
    try:
        text = (
            resp.json()
            .get("candidates", [{}])[0]
            .get("content", {})
            .get("parts", [{}])[0]
            .get("text", "")
            .strip()
        )
        if text.startswith("```"):
            text = text.split("\n", 1)[-1].rsplit("```", 1)[0]
        return json.loads(text)
    except (json.JSONDecodeError, KeyError):
        return {}

5단계: AI 파이프라인 구축 (배치 처리)

# ai/pipeline.py
import json
import yaml
from pathlib import Path
from ai.client import generate

def analyse_batch(items: list[dict], context: str = "", preference_prompt: str = "") -> list[dict]:
    """항목을 배치로 분석합니다. AI 필드가 추가된 항목들을 반환합니다."""
    config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())
    model = config.get("ai", {}).get("model", "gemini-2.5-flash")
    rate_limit = config.get("ai", {}).get("rate_limit_seconds", 7.0)
    min_score = config.get("ai", {}).get("min_score", 0)
    batch_size = config.get("ai", {}).get("batch_size", 5)

    batches = [items[i:i + batch_size] for i in range(0, len(items), batch_size)]
    print(f"  [AI] {len(items)}개 항목 → {len(batches)}번의 API 호출")

    enriched = []
    for i, batch in enumerate(batches):
        print(f"  [AI] 배치 {i + 1}/{len(batches)} 처리 중...")
        prompt = _build_prompt(batch, context, preference_prompt, config)
        result = generate(prompt, model=model, rate_limit=rate_limit)

        analyses = result.get("analyses", [])
        for j, item in enumerate(batch):
            ai = analyses[j] if j < len(analyses) else {}
            if ai:
                score = max(0, min(100, int(ai.get("score", 0))))
                if min_score and score < min_score:
                    continue
                enriched.append({**item, "ai_score": score, "ai_summary": ai.get("summary", ""), "ai_notes": ai.get("notes", "")})
            else:
                enriched.append(item)

    return enriched


def _build_prompt(batch, context, preference_prompt, config):
    priorities = config.get("priorities", [])
    items_text = "\n\n".join(
        f"Item {i+1}: {json.dumps({k: v for k, v in item.items() if not k.startswith('_')})}"
        for i, item in enumerate(batch)
    )

    return f"""다음 {len(batch)}개 항목을 분석하고 JSON 객체를 반환하세요.

# 항목(Items)
{items_text}

# 사용자 컨텍스트
{context[:800] if context else "제공되지 않음"}

# 사용자 우선순위
{chr(10).join(f"- {p}" for p in priorities)}

{preference_prompt}

# 지침(Instructions)
반환 형식: {{"analyses": [{{"score": <0-100>, "summary": "<2문장 요약>", "notes": "<일치하거나 일치하지 않는 이유>"}} 순서대로 각 항목에 대해]}}
간결하게 작성하세요. 점수 90+=매우 우수, 70-89=우수, 50-69=보통, <50=부족."""

6단계: 피드백 학습 시스템 구축

# ai/memory.py
"""향후 점수 산정을 개선하기 위해 사용자의 결정으로부터 학습합니다."""
import json
from pathlib import Path

FEEDBACK_PATH = Path(__file__).parent.parent / "data" / "feedback.json"


def load_feedback() -> dict:
    if FEEDBACK_PATH.exists():
        try:
            return json.loads(FEEDBACK_PATH.read_text())
        except (json.JSONDecodeError, OSError):
            pass
    return {"positive": [], "negative": []}


def save_feedback(fb: dict):
    FEEDBACK_PATH.parent.mkdir(parents=True, exist_ok=True)
    FEEDBACK_PATH.write_text(json.dumps(fb, indent=2))


def build_preference_prompt(feedback: dict, max_examples: int = 15) -> str:
    """피드백 이력을 프롬프트 바이어스(bias) 섹션으로 변환합니다."""
    lines = []
    if feedback.get("positive"):
        lines.append("# 사용자가 좋아한 항목 (긍정적 신호):")
        for e in feedback["positive"][-max_examples:]:
            lines.append(f"- {e}")
    if feedback.get("negative"):
        lines.append("\n# 사용자가 건너뛰거나 거절한 항목 (부정적 신호):")
        for e in feedback["negative"][-max_examples:]:
            lines.append(f"- {e}")
    if lines:
        lines.append("\n새 항목에 대한 점수를 매길 때 이러한 패턴을 반영하세요.")
    return "\n".join(lines)

저장 레이어와의 통합: 각 실행 후, 긍정적/부정적 상태를 가진 항목에 대해 DB를 조회하고 추출된 패턴으로 save_feedback()을 호출합니다.

7단계: 저장소 구축 (Notion 예시)

# storage/notion_sync.py
import os
from notion_client import Client
from notion_client.errors import APIResponseError

_client = None

def get_client():
    global _client
    if _client is None:
        _client = Client(auth=os.environ["NOTION_TOKEN"])
    return _client

def get_existing_urls(db_id: str) -> set[str]:
    """이미 저장된 모든 URL을 가져옵니다 — 중복 제거에 사용됩니다."""
    client, seen, cursor = get_client(), set(), None
    while True:
        resp = client.databases.query(database_id=db_id, page_size=100, **{"start_cursor": cursor} if cursor else {})
        for page in resp["results"]:
            url = page["properties"].get("URL", {}).get("url", "")
            if url: seen.add(url)
        if not resp["has_more"]: break
        cursor = resp["next_cursor"]
    return seen

def push_item(db_id: str, item: dict) -> bool:
    """항목 하나를 Notion에 푸시합니다. 성공 시 True를 반환합니다."""
    props = {
        "Name": {"title": [{"text": {"content": item.get("name", "")[:100]}}]},
        "URL": {"url": item.get("url")},
        "Source": {"select": {"name": item.get("source", "Unknown")}},
        "Date Found": {"date": {"start": item.get("date_found")}},
        "Status": {"select": {"name": "New"}},
    }
    # AI 필드
    if item.get("ai_score") is not None:
        props["AI Score"] = {"number": item["ai_score"]}
    if item.get("ai_summary"):
        props["Summary"] = {"rich_text": [{"text": {"content": item["ai_summary"][:2000]}}]}
    if item.get("ai_notes"):
        props["Notes"] = {"rich_text": [{"text": {"content": item["ai_notes"][:2000]}}]}

    try:
        get_client().pages.create(parent={"database_id": db_id}, properties=props)
        return True
    except APIResponseError as e:
        print(f"[notion] 푸시 실패: {e}")
        return False

def sync(db_id: str, items: list[dict]) -> tuple[int, int]:
    existing = get_existing_urls(db_id)
    added = skipped = 0
    for item in items:
        if item.get("url") in existing:
            skipped += 1; continue
        if push_item(db_id, item):
            added += 1; existing.add(item["url"])
        else:
            skipped += 1
    return added, skipped

8단계: main.py에서 오케스트레이션

# scraper/main.py
import os, sys, yaml
from pathlib import Path
from dotenv import load_dotenv

load_dotenv()

from scraper.sources import my_source          # 소스 추가

# 참고: 이 예제는 Notion을 사용합니다. storage.provider가 "sheets" 또는 "supabase"인 경우,
# 이 임포트를 storage.sheets_sync 또는 storage.supabase_sync로 교체하고
# 환경 변수 및 sync() 호출을 그에 맞게 업데이트하세요.
from storage.notion_sync import sync

SOURCES = [
    ("My Source", my_source.fetch),
]

def ai_enabled():
    return bool(os.environ.get("GEMINI_API_KEY"))

def main():
    config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())
    provider = config.get("storage", {}).get("provider", "notion")

    # 공급자에 따라 환경 변수에서 저장 대상을 결정
    if provider == "notion":
        db_id = os.environ.get("NOTION_DATABASE_ID")
        if not db_id:
            print("ERROR: NOTION_DATABASE_ID가 설정되지 않았습니다."); sys.exit(1)
    else:
        # sheets (SHEET_ID) 또는 supabase (SUPABASE_TABLE) 등을 여기에 확장
        print(f"ERROR: 공급자 '{provider}'가 아직 main.py에 연결되지 않았습니다."); sys.exit(1)

    all_items = []

    for name, fetch_fn in SOURCES:
        try:
            items = fetch_fn()
            print(f"[{name}] {len(items)}개 항목 수집")
            all_items.extend(items)
        except Exception as e:
            print(f"[{name}] 실패: {e}")

    # URL 기준 중복 제거
    seen, deduped = set(), []
    for item in all_items:
        if (url := item.get("url", "")) and url not in seen:
            seen.add(url); deduped.append(item)

    print(f"고유 항목 수: {len(deduped)}")

    if ai_enabled() and deduped:
        from ai.memory import load_feedback, build_preference_prompt
        from ai.pipeline import analyse_batch

        # load_feedback()은 피드백 동기화 스크립트에 의해 작성된 data/feedback.json을 읽습니다.
        # 최신 상태를 유지하려면, 긍정적/부정적 상태를 가진 항목을 저장소에서 쿼리하고
        # save_feedback()을 호출하는 별도의 feedback_sync.py를 구현하세요.
        feedback = load_feedback()
        preference = build_preference_prompt(feedback)
        context_path = Path(__file__).parent.parent / "profile" / "context.md"
        context = context_path.read_text() if context_path.exists() else ""
        deduped = analyse_batch(deduped, context=context, preference_prompt=preference)
    else:
        print("[AI] 건너뜀 — GEMINI_API_KEY가 설정되지 않음")

    added, skipped = sync(db_id, deduped)
    print(f"완료 — 신규 {added}개, 기존 {skipped}개")

if __name__ == "__main__":
    main()

9단계: GitHub Actions 워크플로

# .github/workflows/scraper.yml
name: Data Scraper Agent

on:
  schedule:
    - cron: "0 */3 * * *"  # 3시간마다 실행 — 필요에 따라 조정
  workflow_dispatch:        # 수동 트리거 허용

permissions:
  contents: write   # 피드백 이력 커밋 단계를 위해 필요

jobs:
  scrape:
    runs-on: ubuntu-latest
    timeout-minutes: 20

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: "pip"

      - run: pip install -r requirements.txt

      # 요구 사항 파일에서 Playwright가 활성화된 경우 주석 해제
      # - name: Install Playwright browsers
      #   run: python -m playwright install chromium --with-deps

      - name: Run agent
        env:
          NOTION_TOKEN: ${{ secrets.NOTION_TOKEN }}
          NOTION_DATABASE_ID: ${{ secrets.NOTION_DATABASE_ID }}
          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
        run: python -m scraper.main

      - name: Commit feedback history
        run: |
          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"
          git add data/feedback.json || true
          git diff --cached --quiet || git commit -m "chore: update feedback history"
          git push

10단계: config.yaml 템플릿

# 이 파일을 사용자 정의하세요 — 코드 변경은 필요하지 않습니다.

# 수집 대상 (AI 분석 전 사전 필터)
filters:
  required_keywords: []      # 적어도 하나의 키워드를 포함해야 함
  blocked_keywords: []       # 키워드를 하나라도 포함하면 안 됨

# 우선순위 — AI가 점수를 매길 때 사용합니다.
priorities:
  - "예시 우선순위 1"
  - "예시 우선순위 2"

# 저장소
storage:
  provider: "notion"         # notion | sheets | supabase | sqlite

# 피드백 학습
feedback:
  positive_statuses: ["Saved", "Applied", "Interested"]
  negative_statuses: ["Skip", "Rejected", "Not relevant"]

# AI 설정
ai:
  enabled: true
  model: "gemini-2.5-flash"
  min_score: 0               # 이 점수 미만의 항목은 필터링함
  rate_limit_seconds: 7      # API 호출 간 대기 시간(초)
  batch_size: 5              # API 호출당 항목 수

일반적인 스크래핑 패턴

패턴 1: REST API (가장 쉬움)

resp = requests.get(url, params={"q": query}, headers=HEADERS, timeout=15)
items = resp.json().get("results", [])

패턴 2: HTML 스크래핑

soup = BeautifulSoup(resp.text, "lxml")
for card in soup.select(".listing-card"):
    title = card.select_one("h2").get_text(strip=True)
    href = card.select_one("a")["href"]

패턴 3: RSS 피드

import xml.etree.ElementTree as ET
root = ET.fromstring(resp.text)
for item in root.findall(".//item"):
    title = item.findtext("title", "")
    link = item.findtext("link", "")
    pub_date = item.findtext("pubDate", "")

패턴 4: 페이지네이션 API

page = 1
while True:
    resp = requests.get(url, params={"page": page, "limit": 50}, timeout=15)
    data = resp.json()
    items = data.get("results", [])
    if not items:
        break
    for item in items:
        results.append(_normalise(item))
    if not data.get("has_more"):
        break
    page += 1

패턴 5: JS 렌더링 페이지 (Playwright)

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto(url)
    page.wait_for_selector(".listing")
    html = page.content()
    browser.close()

soup = BeautifulSoup(html, "lxml")

피해야 할 안티패턴

안티패턴	문제점	해결책
항목당 한 번의 LLM 호출	즉시 속도 제한에 걸림	호출당 5개 항목씩 배치 처리
코드 내 키워드 하드코딩	재사용 불가	모든 설정을 `config.yaml`로 이동
속도 제한 없는 스크래핑	IP 차단 가능성	요청 사이에 `time.sleep(1)` 추가
코드 내 비밀키 저장	보안 위험	항상 `.env` + GitHub Secrets 사용
중복 제거 미비	중복된 데이터가 쌓임	저장 전 항상 URL 중복 확인
`robots.txt` 무시	법적/윤리적 위험	크롤링 규칙 준수, 가능한 경우 공개 API 사용
`requests`로 JS 렌더링 사이트 접근	빈 응답	Playwright를 사용하거나 기반 API 확인
`maxOutputTokens`가 너무 낮음	JSON이 잘려 파싱 에러 발생	배치 응답에 2048 이상 사용

무료 티어 한도 참조

서비스	무료 한도	일반적인 사용량
Gemini Flash Lite	30 RPM, 일일 1500회	3시간 간격 시 일일 약 56회 요청
Gemini 2.0 Flash	15 RPM, 일일 1500회	좋은 폴백 수단
Gemini 2.5 Flash	10 RPM, 일일 500회	가급적 아껴서 사용
GitHub Actions	무제한 (공개 저장소)	일일 약 20분 사용
Notion API	무제한	일일 약 200회 쓰기
Supabase	500MB DB, 2GB 전송	대부분의 에이전트에 충분함
Google Sheets API	분당 300회 요청	소규모 에이전트에 적합함

요구 사항 템플릿

requests==2.31.0
beautifulsoup4==4.12.3
lxml==5.1.0
python-dotenv==1.0.1
pyyaml==6.0.2
notion-client==2.2.1   # Notion 사용 시
# playwright==1.40.0   # JS 렌더링 사이트 사용 시 주석 해제

품질 체크리스트

에이전트를 완료하기 전 다음 사항을 확인하세요:

실제 활용 사례

"Hacker News에서 AI 스타트업 투자 소식을 모니터링하는 에이전트를 만들어줘"
"3개 이커머스 사이트에서 제품 가격을 스크래핑하고 가격이 떨어지면 알려줘"
"Hacker News에서 'llm' 또는 'agents' 태그가 달린 새 GitHub 저장소를 추적하고 각각 요약해줘"
"LinkedIn과 Cutshort에서 Chief of Staff 채용 공고를 수집해서 Notion에 정리해줘"
"우리 회사를 언급하는 레딧 게시물을 모니터링하고 감성을 분류해줘"
"내가 관심 있는 주제에 대해 arXiv의 새로운 학술 논문을 매일 수집해줘"
"스포츠 경기 결과를 추적하고 Google Sheets에 순위표를 유지해줘"
"부동산 매물 감시기를 만들어서 특정 가격대 이하의 새 매물이 나오면 알려줘"

참조 구현

이 아키텍처로 구축된 완전한 작동 에이전트는 4개 이상의 소스를 스크래핑하고, Gemini 호출을 배치로 처리하며, Notion에 저장된 결정(지원함/거절함)으로부터 학습하고, GitHub Actions에서 100% 무료로 실행됩니다. 위 1~9단계를 따라 나만의 에이전트를 만들어보세요.