Skill

wrds

Queries WRDS PostgreSQL databases (Compustat, CRSP, ExecuComp, Capital IQ, ISS, TAQ, SDC, DealScan, PitchBook, FISD, FJC) and submits SAS ETL jobs via SGE grid (qsub). Enforces login-node compute restrictions.

SQL

PostgreSQL

data-engineering

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/workflows:wrds

Not user invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

- [WRDS Login Node Enforcement](#wrds-login-node-enforcement)

Supporting Files

examples/blockholders_pipeline/README.mdexamples/blockholders_pipeline/pull_blockholders.pyexamples/blockholders_pipeline/redo_bridge.pyexamples/blockholders_pipeline/src/__init__.pyexamples/blockholders_pipeline/src/aggregate.pyexamples/blockholders_pipeline/src/parser.pyexamples/form4_disposals.pyexamples/form4_pipeline/README.mdexamples/form4_pipeline/form4_step1_query_filings.pyexamples/form4_pipeline/form4_step2_download_xmls.shexamples/form4_pipeline/form4_step3_parse_xmls.pyexamples/form4_pipeline/pull_insider_ownership.pyexamples/form4_pipeline/sas/pull_tr_insiders.sasexamples/form4_pipeline/sas/run_insider_array.shexamples/formd_regd.ipynbexamples/fund_formation_eda.ipynbexamples/lpc_dealscan_eda.ipynbexamples/lpc_dealscan_eda.pyexamples/pitchbook_eda.ipynbexamples/proxy_advisors_pipeline/README.md

SKILL.md

378 lines · ~4.9k tokens

Stats

LanguageJupyter Notebook

Stars16

Forks5

MaintenanceExcellent

Last CommitJun 16, 2026

Actions

View Source View Plugin View on GitHub View README

WRDS Login Node Enforcement

IRON LAW: NEVER RUN COMPUTE ON THE WRDS LOGIN NODE

The WRDS login node is shared infrastructure. Running parsers, bulk file reads, SAS jobs, or any process taking >30 seconds on the login node will get the account flagged.

ALWAYS write an SGE submission script and submit via qsub. No exceptions.

ssh wrds 'cat files.tsv | ./parser > output.tsv' → WRONG. Use qsub.
ssh wrds 'nohup ./process &' → WRONG. Still the login node. Use qsub.
ssh wrds 'python3 bulk_process.py' → WRONG. Use qsub.
qsub -t 1-20 submit.sh → CORRECT.

The login node is for: qsub, qstat, qdel, scp, ls, head, short psql queries.

See references/constraints/wrds-sge-enforcement.md for the full pattern and existing examples (quorum parser, state-of-incorp parser, SAS pipeline).

Running compute on the login node is NOT HELPFUL — it gets the user's account flagged, the job killed, and the work lost. You run on the login node because qsub feels like overhead. The overhead is 5 minutes of script writing. The downside is account suspension and a rerun from scratch.

Login Node & Infrastructure Facts

Tests go through the scheduler too: qsub -t 1-1 submit.sh. The login-node "quick test" is the run that flags the account — one file becomes 100K when the command changes, and 173K filings over NFS is not 30 seconds.
The quorum parser does not run on the login node and never did — it runs via submit_quorum.sh. Citing it as login-node precedent is an unverified claim presented as fact.
The wrds_clean_filings path convention is cik_int.zfill(10)[:6]/{cik_int}/{accession}.txt (see references/edgar.md). Hand-rolled path logic gets this wrong.
scan_covers profiles handle header extraction, body parsing, and custom extractors (Custom field type) — "this parser is different enough to need its own binary" has not yet been true once.

Red Flags — STOP Immediately If You're About To:

Write ssh wrds '... | ./binary > output' → STOP. That's login-node compute. Write a submit script.
Write ssh wrds 'nohup ... &' → STOP. nohup doesn't change the node. Use qsub.
Write ssh wrds 'python3 ...' for anything that reads >10 files → STOP. Use qsub.
Skip reading references/edgar.md before building a new WRDS file parser → STOP. The path conventions, SGE patterns, and existing parsers are already documented. Read them first.
Create a new standalone Go binary for EDGAR extraction → STOP. scripts/scan_covers/ is a generic profile-based framework. Add a profiles_*.go file, not a new binary. The framework handles SGE sharding, path construction, concurrency, and form-type filtering.
Build a new Go/Python parser without checking scripts/scan_covers/ → STOP. This framework exists precisely so you don't reinvent extraction infrastructure. Every standalone parser is technical debt that should have been a profile.

IRON LAW: USE SCAN_COVERS, NOT STANDALONE BINARIES

Before writing ANY new EDGAR filing extractor:

Read scripts/scan_covers/ — generic profile-based Go framework with SGE, concurrency, path handling
Add a profiles_*.go file — not a standalone binary. The Profile struct supports pattern-based fields AND custom extractors (set FullBody: true for body-text searches like prospectus 485 filings — see profiles_proxy_advisors.go)
Read references/edgar.md — path conventions, existing profiles, SGE submission patterns

Building a standalone parser when scan_covers exists is NOT HELPFUL — it reinvents infrastructure that already handles SGE sharding, NFS concurrency, path construction, form-type filtering, and error handling. You built a 300-line standalone Go binary, ran it on the login node, got the path convention wrong, and spent 5 iterations fixing it. Adding a 60-line profile to scan_covers would have worked on the first try.

Every standalone EDGAR parser is technical debt. The scan_covers framework exists to eliminate this class of mistake.

WRDS Data Access

WRDS (Wharton Research Data Services) provides academic research data via PostgreSQL at wrds-pgdata.wharton.upenn.edu:9737.

Query Enforcement

IRON LAW: NO QUERY WITHOUT FILTER VALIDATION FIRST

Before executing ANY WRDS query, you MUST:

IDENTIFY what filters are required for this dataset
VALIDATE the query includes those filters
VERIFY parameterized queries (never string formatting)
EXECUTE the query
INSPECT a sample of results before claiming success

This is not negotiable. Skipping sample inspection is NOT HELPFUL — the user builds analysis on data with undetected quality problems.

Red Flags

Running a query without checking the Critical Filters section → standard filters apply even when the user doesn't mention them, and even for test queries.
Pulling everything to filter in pandas later → filter at the database level first.
Guessing a table name from the request → check the Quick Reference section for exact names.
Claiming success before sample inspection → inspect .head()/.sample() first; query success ≠ data quality.

Query Validation Checklist

Before EVERY query execution:

For Compustat queries (comp.funda, comp.fundq):

For CRSP v2 queries (crsp.dsf_v2, crsp.msf_v2):

Post-query filter: sharetype == 'NS'
Post-query filter: securitytype == 'EQTY'
Post-query filter: securitysubtype == 'COM'
Post-query filter: usincflg == 'Y'
Post-query filter: issuertype.isin(['ACOR', 'CORP'])
Uses parameterized queries

For Form 4 queries (tr_insiders.table1):

Transaction type filter specified (acqdisp)
Transaction codes specified (trancode)
Date range is explicitly specified
Uses parameterized queries

For ALL queries:

Sample inspection with .head() or .sample() BEFORE claiming success
Row count verification (is result size reasonable?)
NULL value check on critical columns
Date range validation (does min/max match expectations?)

SAS ETL Enforcement

IRON LAW: NO SAS CODE WITHOUT PERFORMANCE VALIDATION FIRST

Before writing or executing ANY SAS code on WRDS, you MUST validate performance patterns. This is not negotiable.

MERGE STRATEGY — Is hash or sort-merge appropriate? Justify the choice.
WHERE CLAUSES — Are all date/string filters index-friendly? No functions on indexed columns.
PARALLELISM — Can this job run as an SGE array? Year-by-year is always parallelizable.
SQL OPTIMIZATION — For PROC SQL: pass-through opportunity? Indexed join columns?

Writing SAS code that forces full table scans when indexes exist is NOT HELPFUL — the user's job runs 100x slower than necessary and may timeout.

SAS Code Validation Checklist

Before EVERY SAS program execution:

For merges/joins:

Small lookup + large fact table → hash object (not PROC SORT + DATA merge)
Hash uses defineKey/defineData/defineDone pattern correctly
h.output() uses double quotes for macro resolution (not single quotes)
call missing() initializes hash data variables for non-matches
Both tables >50M rows → sort-merge is justified (document why)

For WHERE clauses (CRITICAL):

NO year(date), month(date), datepart(dt) wrapping indexed columns
Date filters use BETWEEN "01jan&year."d AND "31dec&year."d range pattern
String filters avoid upcase(), substr() on indexed columns
Compound date filters collapsed to single range (not year() = X AND quarter() = Y)

For batch processing:

Multi-year jobs use SGE array (#$ -t start-end) not sequential loop
Year passed via -sysparm (not -set or %sysget)
Per-year log files (not single shared log)
Memory allocation appropriate for workload (#$ -l m_mem_free=4G minimum)
Single-year benchmark run completed before full array submission

For PROC SQL:

Join columns are not wrapped in functions
calculated keyword used for computed column references in HAVING
Pass-through SQL considered for direct WRDS PostgreSQL queries
No redundant subqueries that could be hash lookups

For macros:

Macro variables terminated with period (&year. not &year)
Double quotes used where macro resolution is needed
options mprint mlogic symbolgen used during development

SAS Performance Facts

Hash lookup joins are ~10x faster than PROC SORT + MERGE and need no sorting; PROC SQL still sorts for joins. The hash is 5 extra lines — choosing sort-merge for a lookup join makes the user's job slower for your convenience.
year(date) (or any function) on an indexed column forces a full table scan over millions of rows; BETWEEN with date literals uses the index.
Sequential multi-year jobs run ~18x slower than the SGE array (18 years × 3 minutes = 54 minutes sequential vs 3 minutes parallel) — "I'll parallelize later" is anti-efficient on its own terms.
Single quotes in h.output(dataset: '...') block macro resolution — the output dataset name comes out wrong. Always double quotes.
%sysget is unreliable under SGE — it may return blank silently. Pass the year via -sysparm + &sysparm..

SAS Red Flags - STOP Immediately If You're About To:

Write where year(date) = anything → STOP. Use BETWEEN with date literals.
Write proc sort; data; merge for a lookup join → STOP. Use hash object.
Write a %do year = start %to end loop → STOP. Use SGE array job.
Use single quotes in h.output(dataset: '...') → STOP. Use double quotes.
Submit a full array job without testing one year first → STOP. Benchmark first.
Use -set or %sysget for SGE task parameters → STOP. Use -sysparm.

SAS Reference

See references/sas-etl.md for complete patterns:

Hash object merge (basic, multidata, accumulator)
Index-friendly WHERE clause quick reference table
SGE array job templates with memory and logging
PROC SQL pass-through and optimization
Macro quoting and debugging

Quick Reference: Table Names

Dataset	Schema	Key Tables
Compustat	`comp`	`company`, `funda`, `fundq`, `secd`
ExecuComp	`comp_execucomp`	`anncomp`
CRSP	`crsp`	`dsf`, `msf`, `stocknames`, `ccmxpf_linkhist`
CRSP v2	`crsp`	`dsf_v2`, `msf_v2`, `stocknames_v2`
Form 4 Insiders	`tr_insiders`	`table1`, `header`, `company`
ISS Incentive Lab	`iss_incentive_lab`	`comppeer`, `sumcomp`, `participantfy`
Capital IQ	`ciq`	`wrds_compensation`
IBES	`tr_ibes`	`det_epsus`, `statsum_epsus`
Form D / Reg D	`wrdssec`	`wrds_vc_formd` (parsed, 2000–2020); index: `wrdssec_all.forms` (all CIKs) or `wrds_forms` (filer only) — default to `forms`, see `references/wrds-forms-tables.md`
SEC EDGAR	`wrdssec_all`	`forms` (raw index, all CIKs per filing — default), `wrds_forms` (filer-only view), `wciklink_cusip`
SEC Search	`wrds_sec_search`	`filing_view`, `registrant`
EDGAR	`edgar`	`filings`, `filing_docs`
Fama-French	`ff`	`factors_monthly`, `factors_daily`
LSEG/Datastream	`tr_ds`	`ds2constmth`, `ds2indexlist`
FJC (Federal Judicial Center)	`fjc`	`civil`, `criminal`, `bankruptcy`, `appeals`
FJC Linking	`fjc_linking`	`wrds_civil_link`, `wrds_criminal_link`
SDC New Issues (IPO/SEO/Debt)	`tr_sdc_ni`	`wrds_ni_details` — equity + debt offerings
SDC Mergers & Acquisitions	`tr_sdc_ma`	`wrds_ma_details` — M&A transactions
TAQ Legacy	`taq`	`mast_YYYY`, `wrds_iid_YYYY` — second-level (1993–2006)
TAQ Millisecond	`taqmsec`	`mastm_YYYY`, `wrds_iid_YYYY`, `ctm_YYYYMM`, `complete_nbbo_YYYYMMDD`
Thomson S12 (Mutual Fund Holdings)	`tfn` (SAS) / `tr_mutualfunds` (PG)	`s12` — 13F/N-CSR fund holdings
Thomson S34 (13-F Institutional)	`tfn` (SAS) / `tr_13f` (PG)	`s34` — 13-F institutional holdings
FISD / Mergent (Bonds)	`fisd_fisd`	`fisd_mergedissue`, `fisd_mergedissuer`
PitchBook	`pitchbk_companies_deals`, `pitchbk_investors_funds_lps`, `pitchbk_fund_returns`	`deal`, `company`, `fund`, `wrds_fund_returns` — dealsize in USD millions

Connection

Initialize PostgreSQL connection to WRDS:

import psycopg2

conn = psycopg2.connect(
    host='wrds-pgdata.wharton.upenn.edu',
    port=9737,
    database='wrds',
    sslmode='require'
    # Credentials from ~/.pgpass
)

Configure authentication via ~/.pgpass with chmod 600:

wrds-pgdata.wharton.upenn.edu:9737:wrds:USERNAME:PASSWORD

Connect via SSH tunnel:

ssh wrds

This uses ~/.ssh/wrds_rsa for authentication.

Critical Filters

Compustat Standard Filters

Always include for clean fundamental data:

WHERE indfmt = 'INDL'
  AND datafmt = 'STD'
  AND popsrc = 'D'
  AND consol = 'C'

CRSP v2 Common Stock Filter

Equivalent to legacy shrcd IN (10, 11):

df = df.loc[
    (df.sharetype == 'NS') &
    (df.securitytype == 'EQTY') &
    (df.securitysubtype == 'COM') &
    (df.usincflg == 'Y') &
    (df.issuertype.isin(['ACOR', 'CORP']))
]

Form 4 Transaction Types

WHERE acqdisp = 'D'  -- Dispositions
  AND trancode IN ('S', 'D', 'G', 'F')  -- Sales, Dispositions, Gifts, Tax

Parameterized Queries

Always use parameterized queries (never string formatting):

Use scalar parameter binding for single values:

cursor.execute("""
    SELECT gvkey, conm FROM comp.company WHERE gvkey = %s
""", (gvkey,))

Use ANY() for list parameters:

cursor.execute("""
    SELECT * FROM comp.funda WHERE gvkey = ANY(%s)
""", (gvkey_list,))

Additional Resources

Reference Files

Detailed query patterns and table documentation:

references/compustat.md - Compustat tables, ExecuComp, financial variables
references/crsp.md - CRSP stock data, CCM linking, v2 format
references/insider-form4.md - Thomson Reuters Form 4, rolecodes, insider types
references/iss-compensation.md - ISS Incentive Lab, peer companies, compensation
references/formd.md - Form D / Reg D (canonical): two sources (WRDS wrds_vc_formd + SEC EDGAR TSV/XML), grain & keys, denormalization gotcha, exemption + industry codes, post-2020 gap, validated benchmarks
references/edgar.md - SEC EDGAR filings, URL construction, DCN vs accession numbers
references/connection.md - Connection pooling, caching, error handling
references/taq.md - TAQ: master files, IID, raw tick processing (NBBO, VWAP, closing auctions), CRSP–TAQ merge, era transition (legacy vs millisecond)
references/sas-etl.md - SAS hash objects, index-friendly WHERE, SGE array jobs, PROC SQL optimization
references/postgres-vs-sas.md - Decision guide: when to use PostgreSQL vs SAS for WRDS ETL (benchmarks, constraints, hybrid pattern)
references/fjc.md - FJC Integrated Database: civil/criminal case data, NOS codes, securities litigation queries, firm linking
references/sdc-issuances.md - SDC New Issues: IPOs, SEOs, 144A equity, debt offerings — schema discovery, cleaning filters, CRSP/Compustat linking
references/fisd-bonds.md - FISD/Mergent: corporate bond issuances, IG vs HY, 144A vs registered, rating classification, TRACE linking
references/sdc-ma.md - SDC M&A: deal counts, PE/LBO vs strategic buyer, deal status codes, public vs private target
references/fund-formation.md - Fund formation: Form D (pooled investment funds), EDGAR N-2 (closed-end fund IPOs), Form ADV (RIA registrations)
references/pitchbook.md - PitchBook: schema architecture, dealsize/fundsize in USD millions, dealdate outliers, CIK crosswalk, fund performance (wrds_fund_returns), PE/VC/fund formation patterns
references/proxy-advisors.md - Proxy-advisor customer identification: 485BPOS/485APOS body scan for ISS/Glass Lewis/Egan-Jones name variants; CRSP MFDB lift to mgmt_cd × year; validates against chongshu published CSV

Example Files

Working code from real projects:

examples/form4_disposals.py - Insider trading analysis (from SVB project)
examples/wrds_connector.py - Connection pooling pattern
examples/formd_regd.ipynb - Form D / Reg D: dedup validation, SEC TSV download, exemption trend charts
examples/sdc_issuances_eda.ipynb - SDC New Issues: annual IPO/SEO/debt counts, 144A share, IG vs HY breakdown
examples/sdc_ma_eda.ipynb - SDC M&A: annual deal counts, PE/LBO vs strategic, public vs private target trends
examples/fund_formation_eda.ipynb - Fund formation: Form D 3C.1/3C.7 counts, EDGAR N-2 closed-end fund IPOs, Form ADV RIA registrations
examples/pitchbook_eda.ipynb - PitchBook: PE deal activity, VC rounds by stage, fund formation by vintage, IRR/TVPI by strategy
examples/voting_ownership_pipeline/ - Self-contained hybrid SAS+Python pipeline: ISS votes, 13-F inst. ownership, MF holdings via MFLINKS, merged panel. Canonical example of PostgreSQL vs SAS decision-making on WRDS. See README.md for architecture and usage.

Scripts

scripts/test_connection.py - Validate WRDS connectivity

Local Sample Notebooks

WRDS-provided samples at ~/resources/wrds-code-samples/:

ResearchApps/CCM2025.ipynb - Modern CRSP-Compustat merge
ResearchApps/ff3_crspCIZ.ipynb - Fama-French factor construction
comp/sas/execcomp_ceo_screen.sas - ExecuComp patterns

Date Awareness

When querying historical data, leverage current date context for dynamic range calculations.

Current date is automatically available via datetime.now(). Apply this to:

Data range validation (e.g., "get data for last 5 years")
Fiscal year calculations
Event study windows

Implement dynamic date ranges in queries:

from datetime import datetime, timedelta

# Query last 5 years of data
end_date = datetime.now()
start_date = end_date - timedelta(days=5*365)

query = """
SELECT * FROM comp.funda
WHERE datadate BETWEEN %s AND %s
"""
df = pd.read_sql(query, conn, params=(start_date, end_date))

Always incorporate current date awareness in date-dependent queries to ensure results remain fresh across time.

wrds

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

wrds

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Contents

WRDS Login Node Enforcement

IRON LAW: NEVER RUN COMPUTE ON THE WRDS LOGIN NODE

Login Node & Infrastructure Facts

Red Flags — STOP Immediately If You're About To:

IRON LAW: USE SCAN_COVERS, NOT STANDALONE BINARIES

WRDS Data Access

Query Enforcement

IRON LAW: NO QUERY WITHOUT FILTER VALIDATION FIRST

Red Flags

Query Validation Checklist

SAS ETL Enforcement

IRON LAW: NO SAS CODE WITHOUT PERFORMANCE VALIDATION FIRST

SAS Code Validation Checklist

SAS Performance Facts

SAS Red Flags - STOP Immediately If You're About To:

SAS Reference

Quick Reference: Table Names

Connection

Critical Filters

Compustat Standard Filters

CRSP v2 Common Stock Filter

Form 4 Transaction Types

Parameterized Queries

Additional Resources

Reference Files

Example Files

Scripts

Local Sample Notebooks

Date Awareness

Similar Skills

Contents

WRDS Login Node Enforcement

IRON LAW: NEVER RUN COMPUTE ON THE WRDS LOGIN NODE

Login Node & Infrastructure Facts

Red Flags — STOP Immediately If You're About To:

IRON LAW: USE SCAN_COVERS, NOT STANDALONE BINARIES

WRDS Data Access

Query Enforcement

IRON LAW: NO QUERY WITHOUT FILTER VALIDATION FIRST

Red Flags

Query Validation Checklist

SAS ETL Enforcement

IRON LAW: NO SAS CODE WITHOUT PERFORMANCE VALIDATION FIRST

SAS Code Validation Checklist

SAS Performance Facts

SAS Red Flags - STOP Immediately If You're About To:

SAS Reference

Quick Reference: Table Names

Connection

Critical Filters

Compustat Standard Filters

CRSP v2 Common Stock Filter

Form 4 Transaction Types

Parameterized Queries

Additional Resources

Reference Files

Example Files

Scripts

Local Sample Notebooks

Date Awareness

Similar Skills