Book SFT Pipeline

A complete system for converting books into SFT datasets and training style-transfer models. This skill teaches the pipeline from raw ePub to a model that writes in any author's voice.

When to Activate

Activate this skill when:

Building fine-tuning datasets from literary works
Creating author-voice or style-transfer models
Preparing training data for Tinker or similar SFT platforms
Designing text segmentation pipelines for long-form content
Training small models (8B or less) on limited data

Core Concepts

The Three Pillars of Book SFT

1. Intelligent Segmentation Text chunks must be semantically coherent. Breaking mid-sentence teaches the model to produce fragmented output. Target: 150-400 words per chunk, always at natural boundaries.

2. Diverse Instruction Generation Use multiple prompt templates and system prompts to prevent overfitting. A single prompt style leads to memorization. Use 15+ prompt templates with 5+ system prompts.

3. Style Over Content The goal is learning the author's rhythm and vocabulary patterns, not memorizing plots. Synthetic instructions describe what happens without quoting the text.

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    ORCHESTRATOR AGENT                           │
│  Coordinates pipeline phases, manages state, handles failures   │
└──────────────────────┬──────────────────────────────────────────┘
                       │
       ┌───────────────┼───────────────┬───────────────┐
       ▼               ▼               ▼               ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│  EXTRACTION  │ │ SEGMENTATION │ │  INSTRUCTION │ │   DATASET    │
│    AGENT     │ │    AGENT     │ │    AGENT     │ │   BUILDER    │
│ ePub → Text  │ │ Text → Chunks│ │ Chunks →     │ │ Pairs →      │
│              │ │ 150-400 words│ │ Prompts      │ │ JSONL        │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
                       │
       ┌───────────────┴───────────────┐
       ▼                               ▼
┌──────────────┐               ┌──────────────┐
│   TRAINING   │               │  VALIDATION  │
│    AGENT     │               │    AGENT     │
│ LoRA on      │               │ AI detector  │
│ Tinker       │               │ Originality  │
└──────────────┘               └──────────────┘

Phase 1: Text Extraction

Critical Rules

Always source ePub over PDF - OCR errors become learned patterns
Use paragraph-level extraction - Extract from <p> tags to preserve breaks
Remove front/back matter - Copyright and TOC pollute the dataset

# Extract text from ePub paragraphs
from epub2 import EPub
from bs4 import BeautifulSoup

def extract_epub(path):
    book = EPub(path)
    chapters = []
    for item in book.flow:
        html = book.get_chapter(item.id)
        soup = BeautifulSoup(html, 'html.parser')
        paragraphs = [p.get_text().strip() for p in soup.find_all('p')]
        chapters.append('\n\n'.join(p for p in paragraphs if p))
    return '\n\n'.join(chapters)

Phase 2: Intelligent Segmentation

Smaller Chunks + Overlap

Smaller chunks (150-400 words) produce more training examples and better style transfer than larger chunks (250-650).

def segment(text, min_words=150, max_words=400):
    paragraphs = text.split('\n\n')
    chunks, buffer, buffer_words = [], [], 0
    
    for para in paragraphs:
        words = len(para.split())
        if buffer_words + words > max_words and buffer_words >= min_words:
            chunks.append('\n\n'.join(buffer))
            # Keep last paragraph for overlap
            buffer = [buffer[-1], para] if buffer else [para]
            buffer_words = sum(len(p.split()) for p in buffer)
        else:
            buffer.append(para)
            buffer_words += words
    
    if buffer:
        chunks.append('\n\n'.join(buffer))
    return chunks

Expected Results

For an 86,000-word book:

Old method (250-650 words): ~150 chunks
New method (150-400 + overlap): ~300 chunks
With 2 variants per chunk: 600+ training examples

Phase 3: Diverse Instruction Generation

The Key Insight

Using a single prompt template causes memorization. Diverse templates teach the underlying style.

SYSTEM_PROMPTS = [
    "You are an expert creative writer capable of emulating specific literary styles.",
    "You are a literary writer with deep knowledge of classic prose styles.",
    "You are a creative writer skilled at emulating distinctive authorial voices.",
    "You write prose that captures the essence of modernist literature.",
    "You are a talented writer who can channel classic American authors.",
]

PROMPT_TEMPLATES = [
    "Write a passage in the style of {author}: {desc}",
    "Channel {author}'s voice to write about: {desc}",
    "In {author}'s distinctive prose style, describe: {desc}",
    "Write this scene as {author} would have: {desc}",
    "Using {author}'s repetitive technique, describe: {desc}",
    "Capture the rhythm of {author} in this passage: {desc}",
    "Write like {author}: {desc}",
    "In the voice of {author}, write: {desc}",
    "This is a literary exercise. Write like {author}: {desc}",
    "Can you write in {author}'s style? {desc}",
]

Instruction Generation

INSTRUCTION_PROMPT = """Describe what is happening in this excerpt in 2-3 sentences.
Focus on: characters present, actions, emotions, setting.
Do NOT quote the text directly.

Excerpt:
{text}
"""

# Use a fast, cheap LLM (e.g., Gemini Flash)
instruction = llm_call(INSTRUCTION_PROMPT.format(text=chunk))

Phase 4: Dataset Construction

Message Format

{
    "messages": [
        {"role": "system", "content": "You are an expert creative writer..."},
        {"role": "user", "content": "Write in the style of Author: Scene description..."},
        {"role": "assistant", "content": "The actual book text from chunk..."}
    ]
}

Multiple Variants Per Chunk

def build_examples(chunk, instruction, author, variants=2):
    examples = []
    for i in range(variants):
        system = SYSTEM_PROMPTS[i % len(SYSTEM_PROMPTS)]
        template = PROMPT_TEMPLATES[(chunk.id + i) % len(PROMPT_TEMPLATES)]
        user = template.format(author=author, desc=instruction)
        examples.append({"messages": [
            {"role": "system", "content": system},
            {"role": "user", "content": user},
            {"role": "assistant", "content": chunk.text}
        ]})
    return examples

Phase 5: LoRA Training on Tinker

Configuration

CONFIG = {
    "model_name": "Qwen/Qwen3-8B-Base",  # Base, not instruct
    "lora_rank": 32,                      # 352MB adapter
    "learning_rate": 5e-4,                # Higher for LoRA
    "batch_size": 4,
    "epochs": 3,
}

Why Base Model?

Use base (pretrained) models, not instruction-tuned versions:

Base models are more malleable for new styles
Instruct models have patterns that resist overwriting
Style is a low-level pattern that base models capture better

Training Loop

import tinker
from tinker import types

training_client = await service_client.create_lora_training_client_async(
    base_model="Qwen/Qwen3-8B-Base",
    rank=32
)

for epoch in range(3):
    for batch in batches:
        await training_client.forward_backward_async(batch, loss_fn="cross_entropy")
        await training_client.optim_step_async(types.AdamParams(learning_rate=5e-4))

result = await training_client.save_weights_for_sampler_async(name="final")

Phase 6: Validation

Modern Scenario Test

Test with scenarios that couldn't exist in the original book:

TEST_PROMPTS = [
    "Write about a barista making lattes",
    "Describe lovers communicating through text messages",
    "Write about someone anxious about climate change",
]

If the model applies style markers to modern scenarios, it learned style, not content.

Originality Verification

# Search training data for output phrases
grep "specific phrase from output" dataset.jsonl
# Should return: No matches

AI Detector Testing

Test outputs with GPTZero, Pangram, or ZeroGPT.

Known Issues and Solutions

Character Name Leakage

Symptom: Model uses original character names in new scenarios. Cause: Limited name diversity from one book. Solution: Train on multiple books or add synthetic examples.

Model Parrots Exact Phrases

Symptom: Outputs contain exact sentences from training data. Cause: Too few prompt variations or too many epochs. Solution: Use 15+ templates, limit to 3 epochs.

Fragmented Outputs

Symptom: Sentences feel incomplete. Cause: Poor segmentation breaking mid-thought. Solution: Always break at paragraph boundaries.

Guidelines

Always source ePub over PDF - OCR errors become learned patterns
Never break mid-sentence - Boundaries must be grammatically complete
Use diverse prompts - 15+ templates, 5+ system prompts
Use base models - Not instruct versions
Use smaller chunks - 150-400 words for more examples
Reserve test set - 50 examples minimum
Test on modern scenarios - Proves style transfer vs memorization
Verify originality - Grep training data for output phrases

Expected Results

Metric	Value
Training examples	500-1000 per book
Model	Qwen/Qwen3-8B-Base
LoRA rank	32
Adapter size	~350 MB
Training time	~15 min
Loss reduction	90%+
Style transfer success	~50% perfect

Cost Estimate

Component	Cost
LLM (instruction generation)	~$0.50
Tinker training (15 min)	~$1.50
Total	~$2.00

References

Segmentation Strategies - Text chunking patterns
Tinker Format Specification - Datum structure
Tinker API Documentation - Full API reference

Created: 2025-12-26 Version: 2.0.0

book-sft-pipeline

Book SFT Pipeline

When to Activate

Core Concepts

The Three Pillars of Book SFT

Pipeline Architecture

Phase 1: Text Extraction

Critical Rules

Phase 2: Intelligent Segmentation

Smaller Chunks + Overlap

Expected Results

Phase 3: Diverse Instruction Generation

The Key Insight

Instruction Generation

Phase 4: Dataset Construction

Message Format

Multiple Variants Per Chunk

Phase 5: LoRA Training on Tinker

Configuration

Why Base Model?

Training Loop

Phase 6: Validation

Modern Scenario Test

Originality Verification

AI Detector Testing

Known Issues and Solutions

Character Name Leakage

Model Parrots Exact Phrases

Fragmented Outputs

Guidelines

Expected Results

Cost Estimate

References

Similar Skills