From example-skills
Processes PDFs in Python: extracts text/tables (pdfplumber), merges/splits/rotates (pypdf), generates new PDFs (reportlab). Triggers on PDF tasks or .pdf files.
npx claudepluginhub joshuarweaver/cascade-code-general-misc-3 --plugin marcelleon-skills-zhThis skill uses the workspace's default tool permissions.
本技能覆盖常见 PDF 工作流(Python + 命令行)。
Applies Acme Corporation brand guidelines including colors, fonts, layouts, and messaging to generated PowerPoint, Excel, and PDF documents.
Builds DCF models with sensitivity analysis, Monte Carlo simulations, and scenario planning for investment valuation and risk assessment.
Calculates profitability (ROE, margins), liquidity (current ratio), leverage, efficiency, and valuation (P/E, EV/EBITDA) ratios from financial statements in CSV, JSON, text, or Excel for investment analysis.
本技能覆盖常见 PDF 工作流(Python + 命令行)。
若要填写 PDF 表单,请额外阅读 forms.md;更深入的扩展能力可看 reference.md。
from pypdf import PdfReader, PdfWriter
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
text = ""
for page in reader.pages:
text += page.extract_text()
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)
reader = PdfReader("document.pdf")
meta = reader.metadata
print(meta.title, meta.author, meta.subject, meta.creator)
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90)
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
print(f"Table {j+1} on page {i+1}:")
for row in table:
print(row)
import pandas as pd
import pdfplumber
all_tables = []
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
for table in page.extract_tables():
if table:
all_tables.append(pd.DataFrame(table[1:], columns=table[0]))
if all_tables:
pd.concat(all_tables, ignore_index=True).to_excel("extracted_tables.xlsx", index=False)
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter
c.drawString(100, height - 100, "Hello World!")
c.drawString(100, height - 120, "This is a PDF created with reportlab")
c.line(100, height - 140, 400, height - 140)
c.save()
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
story.append(Paragraph("Report Title", styles["Title"]))
story.append(Spacer(1, 12))
story.append(Paragraph("This is the body of the report. " * 20, styles["Normal"]))
story.append(PageBreak())
story.append(Paragraph("Page 2", styles["Heading1"]))
story.append(Paragraph("Content for page 2", styles["Normal"]))
doc.build(story)
不要直接用 Unicode 下标/上标字符(在默认字体里容易显示为黑方块)。
用 Paragraph 的标记语法:
from reportlab.platypus import Paragraph
from reportlab.lib.styles import getSampleStyleSheet
styles = getSampleStyleSheet()
chemical = Paragraph("H<sub>2</sub>O", styles["Normal"])
squared = Paragraph("x<super>2</super> + y<super>2</super>", styles["Normal"])
pdftotextpdftotext input.pdf output.txt
pdftotext -layout input.pdf output.txt
pdftotext -f 1 -l 5 input.pdf output.txt
qpdfqpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf output.pdf --rotate=+90:1
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
pdftk(若环境可用)pdftk file1.pdf file2.pdf cat output merged.pdf
pdftk input.pdf burst
pdftk input.pdf rotate 1east output rotated.pdf
import pytesseract
from pdf2image import convert_from_path
images = convert_from_path("scanned.pdf")
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)
from pypdf import PdfReader, PdfWriter
watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)
pdfimages -j input.pdf output_prefix
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)
| 任务 | 推荐 |
|---|---|
| 合并/拆分/旋转 | pypdf |
| 文本/表格提取 | pdfplumber |
| 新建 PDF | reportlab |
| 命令行流水线 | qpdf/pdftotext |
| OCR | pytesseract + pdf2image |
| 表单填写 | 先读 forms.md |