From workflows
Queries WRDS PostgreSQL databases (Compustat, CRSP, ExecuComp, Capital IQ, ISS, TAQ, SDC, DealScan, PitchBook, FISD, FJC) and submits SAS ETL jobs via SGE grid (qsub). Enforces login-node compute restrictions.
How this skill is triggered — by the user, by Claude, or both
Slash command
/workflows:wrdsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- [WRDS Login Node Enforcement](#wrds-login-node-enforcement)
examples/blockholders_pipeline/README.mdexamples/blockholders_pipeline/pull_blockholders.pyexamples/blockholders_pipeline/redo_bridge.pyexamples/blockholders_pipeline/src/__init__.pyexamples/blockholders_pipeline/src/aggregate.pyexamples/blockholders_pipeline/src/parser.pyexamples/form4_disposals.pyexamples/form4_pipeline/README.mdexamples/form4_pipeline/form4_step1_query_filings.pyexamples/form4_pipeline/form4_step2_download_xmls.shexamples/form4_pipeline/form4_step3_parse_xmls.pyexamples/form4_pipeline/pull_insider_ownership.pyexamples/form4_pipeline/sas/pull_tr_insiders.sasexamples/form4_pipeline/sas/run_insider_array.shexamples/formd_regd.ipynbexamples/fund_formation_eda.ipynbexamples/lpc_dealscan_eda.ipynbexamples/lpc_dealscan_eda.pyexamples/pitchbook_eda.ipynbexamples/proxy_advisors_pipeline/README.mdALWAYS write an SGE submission script and submit via qsub. No exceptions.
ssh wrds 'cat files.tsv | ./parser > output.tsv' → WRONG. Use qsub.ssh wrds 'nohup ./process &' → WRONG. Still the login node. Use qsub.ssh wrds 'python3 bulk_process.py' → WRONG. Use qsub.qsub -t 1-20 submit.sh → CORRECT.The login node is for: qsub, qstat, qdel, scp, ls, head, short psql queries.
See references/constraints/wrds-sge-enforcement.md for the full pattern and existing examples (quorum parser, state-of-incorp parser, SAS pipeline).
Running compute on the login node is NOT HELPFUL — it gets the user's account flagged, the job killed, and the work lost. You run on the login node because qsub feels like overhead. The overhead is 5 minutes of script writing. The downside is account suspension and a rerun from scratch.
qsub -t 1-1 submit.sh. The login-node "quick test" is the run that flags the account — one file becomes 100K when the command changes, and 173K filings over NFS is not 30 seconds.submit_quorum.sh. Citing it as login-node precedent is an unverified claim presented as fact.wrds_clean_filings path convention is cik_int.zfill(10)[:6]/{cik_int}/{accession}.txt (see references/edgar.md). Hand-rolled path logic gets this wrong.scan_covers profiles handle header extraction, body parsing, and custom extractors (Custom field type) — "this parser is different enough to need its own binary" has not yet been true once.ssh wrds '... | ./binary > output' → STOP. That's login-node compute. Write a submit script.ssh wrds 'nohup ... &' → STOP. nohup doesn't change the node. Use qsub.ssh wrds 'python3 ...' for anything that reads >10 files → STOP. Use qsub.references/edgar.md before building a new WRDS file parser → STOP. The path conventions, SGE patterns, and existing parsers are already documented. Read them first.scripts/scan_covers/ is a generic profile-based framework. Add a profiles_*.go file, not a new binary. The framework handles SGE sharding, path construction, concurrency, and form-type filtering.scripts/scan_covers/ → STOP. This framework exists precisely so you don't reinvent extraction infrastructure. Every standalone parser is technical debt that should have been a profile.scripts/scan_covers/ — generic profile-based Go framework with SGE, concurrency, path handlingprofiles_*.go file — not a standalone binary. The Profile struct supports pattern-based fields AND custom extractors (set FullBody: true for body-text searches like prospectus 485 filings — see profiles_proxy_advisors.go)references/edgar.md — path conventions, existing profiles, SGE submission patternsBuilding a standalone parser when scan_covers exists is NOT HELPFUL — it reinvents infrastructure that already handles SGE sharding, NFS concurrency, path construction, form-type filtering, and error handling. You built a 300-line standalone Go binary, ran it on the login node, got the path convention wrong, and spent 5 iterations fixing it. Adding a 60-line profile to scan_covers would have worked on the first try.
Every standalone EDGAR parser is technical debt. The scan_covers framework exists to eliminate this class of mistake.
WRDS (Wharton Research Data Services) provides academic research data via PostgreSQL at wrds-pgdata.wharton.upenn.edu:9737.
Before executing ANY WRDS query, you MUST:
This is not negotiable. Skipping sample inspection is NOT HELPFUL — the user builds analysis on data with undetected quality problems.
.head()/.sample() first; query success ≠ data quality.Before EVERY query execution:
For Compustat queries (comp.funda, comp.fundq):
indfmt = 'INDL'datafmt = 'STD'popsrc = 'D'consol = 'C'For CRSP v2 queries (crsp.dsf_v2, crsp.msf_v2):
sharetype == 'NS'securitytype == 'EQTY'securitysubtype == 'COM'usincflg == 'Y'issuertype.isin(['ACOR', 'CORP'])For Form 4 queries (tr_insiders.table1):
For ALL queries:
.head() or .sample() BEFORE claiming successWriting SAS code that forces full table scans when indexes exist is NOT HELPFUL — the user's job runs 100x slower than necessary and may timeout.
Before EVERY SAS program execution:
For merges/joins:
PROC SORT + DATA merge)defineKey/defineData/defineDone pattern correctlyh.output() uses double quotes for macro resolution (not single quotes)call missing() initializes hash data variables for non-matchesFor WHERE clauses (CRITICAL):
year(date), month(date), datepart(dt) wrapping indexed columnsBETWEEN "01jan&year."d AND "31dec&year."d range patternupcase(), substr() on indexed columnsyear() = X AND quarter() = Y)For batch processing:
#$ -t start-end) not sequential loop-sysparm (not -set or %sysget)#$ -l m_mem_free=4G minimum)For PROC SQL:
calculated keyword used for computed column references in HAVINGFor macros:
&year. not &year)options mprint mlogic symbolgen used during developmentPROC SORT + MERGE and need no sorting; PROC SQL still sorts for joins. The hash is 5 extra lines — choosing sort-merge for a lookup join makes the user's job slower for your convenience.year(date) (or any function) on an indexed column forces a full table scan over millions of rows; BETWEEN with date literals uses the index.h.output(dataset: '...') block macro resolution — the output dataset name comes out wrong. Always double quotes.%sysget is unreliable under SGE — it may return blank silently. Pass the year via -sysparm + &sysparm..where year(date) = anything → STOP. Use BETWEEN with date literals.proc sort; data; merge for a lookup join → STOP. Use hash object.%do year = start %to end loop → STOP. Use SGE array job.h.output(dataset: '...') → STOP. Use double quotes.-set or %sysget for SGE task parameters → STOP. Use -sysparm.See references/sas-etl.md for complete patterns:
| Dataset | Schema | Key Tables |
|---|---|---|
| Compustat | comp | company, funda, fundq, secd |
| ExecuComp | comp_execucomp | anncomp |
| CRSP | crsp | dsf, msf, stocknames, ccmxpf_linkhist |
| CRSP v2 | crsp | dsf_v2, msf_v2, stocknames_v2 |
| Form 4 Insiders | tr_insiders | table1, header, company |
| ISS Incentive Lab | iss_incentive_lab | comppeer, sumcomp, participantfy |
| Capital IQ | ciq | wrds_compensation |
| IBES | tr_ibes | det_epsus, statsum_epsus |
| Form D / Reg D | wrdssec | wrds_vc_formd (parsed, 2000–2020); index: wrdssec_all.forms (all CIKs) or wrds_forms (filer only) — default to forms, see references/wrds-forms-tables.md |
| SEC EDGAR | wrdssec_all | forms (raw index, all CIKs per filing — default), wrds_forms (filer-only view), wciklink_cusip |
| SEC Search | wrds_sec_search | filing_view, registrant |
| EDGAR | edgar | filings, filing_docs |
| Fama-French | ff | factors_monthly, factors_daily |
| LSEG/Datastream | tr_ds | ds2constmth, ds2indexlist |
| FJC (Federal Judicial Center) | fjc | civil, criminal, bankruptcy, appeals |
| FJC Linking | fjc_linking | wrds_civil_link, wrds_criminal_link |
| SDC New Issues (IPO/SEO/Debt) | tr_sdc_ni | wrds_ni_details — equity + debt offerings |
| SDC Mergers & Acquisitions | tr_sdc_ma | wrds_ma_details — M&A transactions |
| TAQ Legacy | taq | mast_YYYY, wrds_iid_YYYY — second-level (1993–2006) |
| TAQ Millisecond | taqmsec | mastm_YYYY, wrds_iid_YYYY, ctm_YYYYMM, complete_nbbo_YYYYMMDD |
| Thomson S12 (Mutual Fund Holdings) | tfn (SAS) / tr_mutualfunds (PG) | s12 — 13F/N-CSR fund holdings |
| Thomson S34 (13-F Institutional) | tfn (SAS) / tr_13f (PG) | s34 — 13-F institutional holdings |
| FISD / Mergent (Bonds) | fisd_fisd | fisd_mergedissue, fisd_mergedissuer |
| PitchBook | pitchbk_companies_deals, pitchbk_investors_funds_lps, pitchbk_fund_returns | deal, company, fund, wrds_fund_returns — dealsize in USD millions |
Initialize PostgreSQL connection to WRDS:
import psycopg2
conn = psycopg2.connect(
host='wrds-pgdata.wharton.upenn.edu',
port=9737,
database='wrds',
sslmode='require'
# Credentials from ~/.pgpass
)
Configure authentication via ~/.pgpass with chmod 600:
wrds-pgdata.wharton.upenn.edu:9737:wrds:USERNAME:PASSWORD
Connect via SSH tunnel:
ssh wrds
This uses ~/.ssh/wrds_rsa for authentication.
Always include for clean fundamental data:
WHERE indfmt = 'INDL'
AND datafmt = 'STD'
AND popsrc = 'D'
AND consol = 'C'
Equivalent to legacy shrcd IN (10, 11):
df = df.loc[
(df.sharetype == 'NS') &
(df.securitytype == 'EQTY') &
(df.securitysubtype == 'COM') &
(df.usincflg == 'Y') &
(df.issuertype.isin(['ACOR', 'CORP']))
]
WHERE acqdisp = 'D' -- Dispositions
AND trancode IN ('S', 'D', 'G', 'F') -- Sales, Dispositions, Gifts, Tax
Always use parameterized queries (never string formatting):
Use scalar parameter binding for single values:
cursor.execute("""
SELECT gvkey, conm FROM comp.company WHERE gvkey = %s
""", (gvkey,))
Use ANY() for list parameters:
cursor.execute("""
SELECT * FROM comp.funda WHERE gvkey = ANY(%s)
""", (gvkey_list,))
Detailed query patterns and table documentation:
references/compustat.md - Compustat tables, ExecuComp, financial variablesreferences/crsp.md - CRSP stock data, CCM linking, v2 formatreferences/insider-form4.md - Thomson Reuters Form 4, rolecodes, insider typesreferences/iss-compensation.md - ISS Incentive Lab, peer companies, compensationreferences/formd.md - Form D / Reg D (canonical): two sources (WRDS wrds_vc_formd + SEC EDGAR TSV/XML), grain & keys, denormalization gotcha, exemption + industry codes, post-2020 gap, validated benchmarksreferences/edgar.md - SEC EDGAR filings, URL construction, DCN vs accession numbersreferences/connection.md - Connection pooling, caching, error handlingreferences/taq.md - TAQ: master files, IID, raw tick processing (NBBO, VWAP, closing auctions), CRSP–TAQ merge, era transition (legacy vs millisecond)references/sas-etl.md - SAS hash objects, index-friendly WHERE, SGE array jobs, PROC SQL optimizationreferences/postgres-vs-sas.md - Decision guide: when to use PostgreSQL vs SAS for WRDS ETL (benchmarks, constraints, hybrid pattern)references/fjc.md - FJC Integrated Database: civil/criminal case data, NOS codes, securities litigation queries, firm linkingreferences/sdc-issuances.md - SDC New Issues: IPOs, SEOs, 144A equity, debt offerings — schema discovery, cleaning filters, CRSP/Compustat linkingreferences/fisd-bonds.md - FISD/Mergent: corporate bond issuances, IG vs HY, 144A vs registered, rating classification, TRACE linkingreferences/sdc-ma.md - SDC M&A: deal counts, PE/LBO vs strategic buyer, deal status codes, public vs private targetreferences/fund-formation.md - Fund formation: Form D (pooled investment funds), EDGAR N-2 (closed-end fund IPOs), Form ADV (RIA registrations)references/pitchbook.md - PitchBook: schema architecture, dealsize/fundsize in USD millions, dealdate outliers, CIK crosswalk, fund performance (wrds_fund_returns), PE/VC/fund formation patternsreferences/proxy-advisors.md - Proxy-advisor customer identification: 485BPOS/485APOS body scan for ISS/Glass Lewis/Egan-Jones name variants; CRSP MFDB lift to mgmt_cd × year; validates against chongshu published CSVWorking code from real projects:
examples/form4_disposals.py - Insider trading analysis (from SVB project)examples/wrds_connector.py - Connection pooling patternexamples/formd_regd.ipynb - Form D / Reg D: dedup validation, SEC TSV download, exemption trend chartsexamples/sdc_issuances_eda.ipynb - SDC New Issues: annual IPO/SEO/debt counts, 144A share, IG vs HY breakdownexamples/sdc_ma_eda.ipynb - SDC M&A: annual deal counts, PE/LBO vs strategic, public vs private target trendsexamples/fund_formation_eda.ipynb - Fund formation: Form D 3C.1/3C.7 counts, EDGAR N-2 closed-end fund IPOs, Form ADV RIA registrationsexamples/pitchbook_eda.ipynb - PitchBook: PE deal activity, VC rounds by stage, fund formation by vintage, IRR/TVPI by strategyexamples/voting_ownership_pipeline/ - Self-contained hybrid SAS+Python pipeline: ISS votes, 13-F inst. ownership, MF holdings via MFLINKS, merged panel. Canonical example of PostgreSQL vs SAS decision-making on WRDS. See README.md for architecture and usage.scripts/test_connection.py - Validate WRDS connectivityWRDS-provided samples at ~/resources/wrds-code-samples/:
ResearchApps/CCM2025.ipynb - Modern CRSP-Compustat mergeResearchApps/ff3_crspCIZ.ipynb - Fama-French factor constructioncomp/sas/execcomp_ceo_screen.sas - ExecuComp patternsWhen querying historical data, leverage current date context for dynamic range calculations.
Current date is automatically available via datetime.now(). Apply this to:
Implement dynamic date ranges in queries:
from datetime import datetime, timedelta
# Query last 5 years of data
end_date = datetime.now()
start_date = end_date - timedelta(days=5*365)
query = """
SELECT * FROM comp.funda
WHERE datadate BETWEEN %s AND %s
"""
df = pd.read_sql(query, conn, params=(start_date, end_date))
Always incorporate current date awareness in date-dependent queries to ensure results remain fresh across time.
Access financial data from LSEG/Refinitiv via the lseg.data Python API. Supports fundamentals, market data, ESG, symbology, deals, loans, funds, screening, and news with strict validation.
Accesses and extracts structured data from SEC EDGAR filings using the edgartools Python library. Supports financial statements, XBRL data, insider trading (Form 4), institutional holdings (13F), and company screening by ticker/CIK.
Access, analyze, and extract data from SEC EDGAR filings including financial statements, insider trading, and institutional holdings using Python.
npx claudepluginhub edwinhu/workflows --plugin workflows