From aradotso-trending-skills-37
Extracts structured PII spans (names, emails, phones, addresses, accounts, secrets) from text using reversed OpenAI Privacy Filter 1.5B model with hybrid regex backend, returning label, text, and offsets.
npx claudepluginhub joshuarweaver/cascade-ai-ml-agents-misc-1 --plugin aradotso-trending-skills-37This skill uses the workspace's default tool permissions.
> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
Skill by ara.so — Daily 2026 Skills collection.
privacy-parser is the inverse of OpenAI's Privacy Filter. Where the filter masks PII with <REDACTED>, this library returns structured spans — label, text, and character offsets — using the same 1.5B opf model weights and label taxonomy.
# Clone the repo (includes both subpackages)
git clone https://github.com/chiefautism/privacy-parser
cd privacy-parser
uv venv
uv pip install -e ./privacy-filter # installs the opf model + weights loader
uv pip install -e ./pii_parser # installs the parser library
First run downloads the opf 1.5B checkpoint (~3 GB) to ~/.opf/privacy_filter/.
from pii_parser.hybrid import HybridPIIParser
parser = HybridPIIParser(device="cpu") # or "cuda" / "mps"
result = parser.parse(
"Hi Quindle Testwick (quindle.testwick@openai.com / +1-415-555-0102), "
"account 40702810500001234567, 14 Beautiful Ct, Anytown USA, "
"password Priv4cy-Filt3r-2026."
)
for span in result.spans:
print(f"{span.label:18} {span.text}")
Output:
private_person Quindle Testwick
private_email quindle.testwick@openai.com
private_phone +1-415-555-0102
account_number 40702810500001234567
private_address 14 Beautiful Ct, Anytown USA
secret Priv4cy-Filt3r-2026
Choose the backend based on your speed/accuracy tradeoff:
| Backend | Weights | Speed | F1 | When to use |
|---|---|---|---|---|
PIIParser | none | µs | 1.000 | Tests, known-format structured data |
ModelPIIParser | 1.5B | ~500ms CPU | 0.733 | Model-only, no post-processing |
HybridPIIParser | 1.5B | ~600ms CPU | 0.929 | Production — ship this one |
# Regex-only (no model, instant, high precision on structured formats)
from pii_parser import PIIParser
parser = PIIParser()
# Model-only (raw BIOES logits → Viterbi → spans)
from pii_parser.model import ModelPIIParser
parser = ModelPIIParser(device="cpu")
# Hybrid: model + span-merge + regex backstop (recommended)
from pii_parser.hybrid import HybridPIIParser
parser = HybridPIIParser(device="cpu")
Each span in result.spans has:
span.label # str — one of the 8 label types
span.text # str — the extracted substring
span.start # int — char offset in original string
span.end # int — char offset (exclusive)
private_person — full names of individuals
private_email — email addresses
private_phone — phone numbers (any format)
private_address — street/postal addresses
private_url — personal/private URLs
private_date — dates tied to individuals
account_number — bank/card/account identifiers
secret — passwords, tokens, API keys
from pii_parser.hybrid import HybridPIIParser
parser = HybridPIIParser(device="cpu")
texts = [
"Email Bob at bob@example.com",
"SSN: 123-45-6789, DOB: 1990-03-15",
"Token: ghp_abc123XYZ",
]
for text in texts:
result = parser.parse(text)
if result.spans:
print(f"Text: {text!r}")
for s in result.spans:
print(f" [{s.start}:{s.end}] {s.label} → {s.text!r}")
print()
result = parser.parse(long_document)
emails = [s for s in result.spans if s.label == "private_email"]
phones = [s for s in result.spans if s.label == "private_phone"]
secrets = [s for s in result.spans if s.label == "secret"]
accounts = [s for s in result.spans if s.label == "account_number"]
def redact(text: str, spans) -> str:
"""Replace extracted PII with [LABEL] tokens."""
result = list(text)
for span in sorted(spans, key=lambda s: s.start, reverse=True):
result[span.start:span.end] = f"[{span.label.upper()}]"
return "".join(result)
result = parser.parse("Call Alice at 555-0100 re: account 9988776655.")
clean = redact("Call Alice at 555-0100 re: account 9988776655.", result.spans)
# "Call [PRIVATE_PERSON] at [PRIVATE_PHONE] re: account [ACCOUNT_NUMBER]."
import json
result = parser.parse("Jane Doe, jane@corp.io, +44 20 7946 0958")
payload = [
{"label": s.label, "text": s.text, "start": s.start, "end": s.end}
for s in result.spans
]
print(json.dumps(payload, indent=2))
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
parser = HybridPIIParser(device=device)
# Parse a string directly
python -m pii_parser.cli_model "Alice paid 40702810500001234567 on 2026-05-17."
# Pipe text from a file
cat dump.txt | python -m pii_parser.cli_model -
text
↓
opf 1.5B → BIOES logits → Viterbi (tuned transitions) → char spans
↓
span-merge (glues multi-token names: "Quindle" + "Testwick" → one span)
↓
regex backstop (URL, secret, account_number — fills model gaps)
↓
result.spans[]
# Full fixture suite + latency benchmark
python pii_parser/tests/test_hybrid.py
Expected output:
Fixture F1: 0.929
Scenarios: 8/8 passed
Latency: ~600 ms CPU
Slow first run — The checkpoint (~3 GB) downloads to ~/.opf/privacy_filter/ on first use. Subsequent runs load from cache.
CUDA out of memory — Use device="cpu" or reduce batch size; the 1.5B model requires ~3 GB VRAM on GPU.
Low recall on secrets/URLs — Use HybridPIIParser (not ModelPIIParser); the regex backstop specifically covers these labels.
Span text doesn't match offsets — Offsets are byte-safe character indices into the original string passed to parse(). Do not preprocess/strip the string before parsing if you need offsets to remain valid.
Import error on privacy_filter — Ensure you installed both packages: uv pip install -e ./privacy-filter AND uv pip install -e ./pii_parser.
Model not found — Delete ~/.opf/privacy_filter/ and re-run to trigger a fresh download.