Vision Framework Computer Vision

Guides you through implementing computer vision: subject segmentation, hand/body pose detection, person detection, text recognition, barcode detection, document scanning, and combining Vision APIs to solve complex problems.

When to Use This Skill

Use when you need to:

☑ Isolate subjects from backgrounds (subject lifting)
☑ Detect and track hand poses for gestures
☑ Detect and track body poses for fitness/action classification
☑ Segment multiple people separately
☑ Exclude hands from object bounding boxes (combining APIs)
☑ Choose between VisionKit and Vision framework
☑ Combine Vision with CoreImage for compositing
☑ Decide which Vision API solves your problem
☑ Recognize text in images (OCR)
☑ Detect barcodes and QR codes
☑ Scan documents with perspective correction
☑ Extract structured data from documents (iOS 26+)
☑ Build live scanning experiences (DataScannerViewController)

Example Prompts

"How do I isolate a subject from the background?" "I need to detect hand gestures like pinch" "How can I get a bounding box around an object without including the hand holding it?" "Should I use VisionKit or Vision framework for subject lifting?" "How do I segment multiple people separately?" "I need to detect body poses for a fitness app" "How do I preserve HDR when compositing subjects on new backgrounds?" "How do I recognize text in an image?" "I need to scan QR codes from camera" "How do I extract data from a receipt?" "Should I use DataScannerViewController or Vision directly?" "How do I scan documents and correct perspective?" "I need to extract table data from a document"

Red Flags

Signs you're making this harder than it needs to be:

❌ Manually implementing subject segmentation with CoreML models
❌ Using ARKit just for body pose (Vision works offline)
❌ Writing gesture recognition from scratch (use hand pose + simple distance checks)
❌ Processing on main thread (blocks UI - Vision is resource intensive)
❌ Training custom models when Vision APIs already exist
❌ Not checking confidence scores (low confidence = unreliable landmarks)
❌ Forgetting to convert coordinates (lower-left origin vs UIKit top-left)
❌ Building custom text recognizer when VNRecognizeTextRequest exists
❌ Using AVFoundation + Vision when DataScannerViewController suffices
❌ Processing every camera frame for scanning (skip frames, use region of interest)
❌ Enabling all barcode symbologies when you only need one (performance hit)
❌ Ignoring RecognizeDocumentsRequest when you need table/list structure (iOS 26+)

Mandatory First Steps

Before implementing any Vision feature:

1. Choose the Right API (Decision Tree)

What do you need to do?

┌─ Isolate subject(s) from background?
│  ├─ Need system UI + out-of-process → VisionKit
│  │  └─ ImageAnalysisInteraction (iOS/iPadOS)
│  │  └─ ImageAnalysisOverlayView (macOS)
│  ├─ Need custom pipeline / HDR / large images → Vision
│  │  └─ VNGenerateForegroundInstanceMaskRequest
│  └─ Need to EXCLUDE hands from object → Combine APIs
│     └─ Subject mask + Hand pose + custom masking (see Pattern 1)
│
├─ Segment people?
│  ├─ All people in one mask → VNGeneratePersonSegmentationRequest
│  └─ Separate mask per person (up to 4) → VNGeneratePersonInstanceMaskRequest
│
├─ Detect hand pose/gestures?
│  ├─ Just hand location → VNDetectHumanRectanglesRequest
│  └─ 21 hand landmarks → VNDetectHumanHandPoseRequest
│     └─ Gesture recognition → Hand pose + distance checks
│
├─ Detect body pose?
│  ├─ 2D normalized landmarks → VNDetectHumanBodyPoseRequest
│  ├─ 3D real-world coordinates → VNDetectHumanBodyPose3DRequest
│  └─ Action classification → Body pose + CreateML model
│
├─ Face detection?
│  ├─ Just bounding boxes → VNDetectFaceRectanglesRequest
│  └─ Detailed landmarks → VNDetectFaceLandmarksRequest
│
├─ Person detection (location only)?
│  └─ VNDetectHumanRectanglesRequest
│
├─ Recognize text in images?
│  ├─ Real-time from camera + need UI → DataScannerViewController (iOS 16+)
│  ├─ Processing captured image → VNRecognizeTextRequest
│  │  ├─ Need speed (real-time camera) → recognitionLevel = .fast
│  │  └─ Need accuracy (documents) → recognitionLevel = .accurate
│  └─ Need structured documents (iOS 26+) → RecognizeDocumentsRequest
│
├─ Detect barcodes/QR codes?
│  ├─ Real-time camera + need UI → DataScannerViewController (iOS 16+)
│  └─ Processing image → VNDetectBarcodesRequest
│
└─ Scan documents?
   ├─ Need built-in UI + perspective correction → VNDocumentCameraViewController
   ├─ Need structured data (tables, lists) → RecognizeDocumentsRequest (iOS 26+)
   └─ Custom pipeline → VNDetectDocumentSegmentationRequest + perspective correction

2. Set Up Background Processing

NEVER run Vision on main thread:

let processingQueue = DispatchQueue(label: "com.yourapp.vision", qos: .userInitiated)

processingQueue.async {
    do {
        let request = VNGenerateForegroundInstanceMaskRequest()
        let handler = VNImageRequestHandler(cgImage: image)
        try handler.perform([request])

        // Process observations...

        DispatchQueue.main.async {
            // Update UI
        }
    } catch {
        // Handle error
    }
}

3. Verify Platform Availability

API	Minimum Version
Subject segmentation (instance masks)	iOS 17+
VisionKit subject lifting	iOS 16+
Hand pose	iOS 14+
Body pose (2D)	iOS 14+
Body pose (3D)	iOS 17+
Person instance segmentation	iOS 17+
VNRecognizeTextRequest (basic)	iOS 13+
VNRecognizeTextRequest (accurate, multi-lang)	iOS 14+
VNDetectBarcodesRequest	iOS 11+
VNDetectBarcodesRequest (revision 2: Codabar, MicroQR)	iOS 15+
VNDetectBarcodesRequest (revision 3: ML-based)	iOS 16+
DataScannerViewController	iOS 16+
VNDocumentCameraViewController	iOS 13+
VNDetectDocumentSegmentationRequest	iOS 15+
RecognizeDocumentsRequest	iOS 26+

Common Patterns

Pattern 1: Isolate Object While Excluding Hand

User's original problem: Getting a bounding box around an object held in hand, without including the hand.

Root cause: VNGenerateForegroundInstanceMaskRequest is class-agnostic and treats hand+object as one subject.

Solution: Combine subject mask with hand pose to create exclusion mask.

// 1. Get subject instance mask
let subjectRequest = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: sourceImage)
try handler.perform([subjectRequest])

guard let subjectObservation = subjectRequest.results?.first as? VNInstanceMaskObservation else {
    fatalError("No subject detected")
}

// 2. Get hand pose landmarks
let handRequest = VNDetectHumanHandPoseRequest()
handRequest.maximumHandCount = 2
try handler.perform([handRequest])

guard let handObservation = handRequest.results?.first as? VNHumanHandPoseObservation else {
    // No hand detected - use full subject mask
    let mask = try subjectObservation.createScaledMask(
        for: subjectObservation.allInstances,
        croppedToInstancesContent: false
    )
    return mask
}

// 3. Create hand exclusion region from landmarks
let handPoints = try handObservation.recognizedPoints(.all)
let handBounds = calculateConvexHull(from: handPoints)  // Your implementation

// 4. Subtract hand region from subject mask using CoreImage
let subjectMask = try subjectObservation.createScaledMask(
    for: subjectObservation.allInstances,
    croppedToInstancesContent: false
)

let subjectCIMask = CIImage(cvPixelBuffer: subjectMask)
let handMask = createMaskFromRegion(handBounds, size: sourceImage.size)
let finalMask = subtractMasks(handMask: handMask, from: subjectCIMask)

// 5. Calculate bounding box from final mask
let objectBounds = calculateBoundingBox(from: finalMask)

Helper: Convex Hull

func calculateConvexHull(from points: [VNRecognizedPointKey: VNRecognizedPoint]) -> CGRect {
    // Get high-confidence points
    let validPoints = points.values.filter { $0.confidence > 0.5 }

    guard !validPoints.isEmpty else { return .zero }

    // Simple bounding rect (for more accuracy, use actual convex hull algorithm)
    let xs = validPoints.map { $0.location.x }
    let ys = validPoints.map { $0.location.y }

    let minX = xs.min()!
    let maxX = xs.max()!
    let minY = ys.min()!
    let maxY = ys.max()!

    return CGRect(
        x: minX,
        y: minY,
        width: maxX - minX,
        height: maxY - minY
    )
}

Cost: 2-5 hours initial implementation, 30 min ongoing maintenance

Pattern 2: VisionKit Simple Subject Lifting

Use case: Add system-like subject lifting UI with minimal code.

// iOS
let interaction = ImageAnalysisInteraction()
interaction.preferredInteractionTypes = .imageSubject
imageView.addInteraction(interaction)

// macOS
let overlayView = ImageAnalysisOverlayView()
overlayView.preferredInteractionTypes = .imageSubject
nsView.addSubview(overlayView)

When to use:

✓ Want system behavior (long-press to select, drag to share)
✓ Don't need custom processing pipeline
✓ Image size within VisionKit limits (out-of-process)

Cost: 15 min implementation, 5 min ongoing

Pattern 3: Programmatic Subject Access (VisionKit)

Use case: Need subject images/bounds without UI interaction.

let analyzer = ImageAnalyzer()
let configuration = ImageAnalyzer.Configuration([.text, .visualLookUp])

let analysis = try await analyzer.analyze(sourceImage, configuration: configuration)

// Get all subjects
for subject in analysis.subjects {
    let subjectImage = subject.image
    let subjectBounds = subject.bounds

    // Process subject...
}

// Tap-based lookup
if let subject = try await analysis.subject(at: tapPoint) {
    let compositeImage = try await analysis.image(for: [subject])
}

Cost: 30 min implementation, 10 min ongoing

Pattern 4: Vision Instance Mask for Custom Pipeline

Use case: HDR preservation, large images, custom compositing.

let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: sourceImage)
try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

// Get soft segmentation mask
let mask = try observation.createScaledMask(
    for: observation.allInstances,
    croppedToInstancesContent: false  // Full resolution for compositing
)

// Use with CoreImage for HDR preservation
let filter = CIFilter(name: "CIBlendWithMask")!
filter.setValue(CIImage(cgImage: sourceImage), forKey: kCIInputImageKey)
filter.setValue(CIImage(cvPixelBuffer: mask), forKey: kCIInputMaskImageKey)
filter.setValue(newBackground, forKey: kCIInputBackgroundImageKey)

let compositedImage = filter.outputImage

Cost: 1 hour implementation, 15 min ongoing

Pattern 5: Tap-to-Select Instance

Use case: User taps to select which subject/person to lift.

// Get instance at tap point
let instance = observation.instanceAtPoint(tapPoint)

if instance == 0 {
    // Background tapped - select all instances
    let mask = try observation.createScaledMask(
        for: observation.allInstances,
        croppedToInstancesContent: false
    )
} else {
    // Specific instance tapped
    let mask = try observation.createScaledMask(
        for: IndexSet(integer: instance),
        croppedToInstancesContent: true
    )
}

Alternative: Raw pixel buffer access

let instanceMask = observation.instanceMask

CVPixelBufferLockBaseAddress(instanceMask, .readOnly)
defer { CVPixelBufferUnlockBaseAddress(instanceMask, .readOnly) }

let baseAddress = CVPixelBufferGetBaseAddress(instanceMask)
let bytesPerRow = CVPixelBufferGetBytesPerRow(instanceMask)

// Convert normalized tap to pixel coordinates
let pixelPoint = VNImagePointForNormalizedPoint(
    tapPoint,
    width: imageWidth,
    height: imageHeight
)

let offset = Int(pixelPoint.y) * bytesPerRow + Int(pixelPoint.x)
let label = UnsafeRawPointer(baseAddress!).load(
    fromByteOffset: offset,
    as: UInt8.self
)

Cost: 45 min implementation, 10 min ongoing

Pattern 6: Hand Gesture Recognition (Pinch)

Use case: Detect pinch gesture for custom camera trigger or UI control.

let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 1

try handler.perform([request])

guard let observation = request.results?.first as? VNHumanHandPoseObservation else {
    return
}

let thumbTip = try observation.recognizedPoint(.thumbTip)
let indexTip = try observation.recognizedPoint(.indexTip)

// Check confidence
guard thumbTip.confidence > 0.5, indexTip.confidence > 0.5 else {
    return
}

// Calculate distance (normalized coordinates)
let dx = thumbTip.location.x - indexTip.location.x
let dy = thumbTip.location.y - indexTip.location.y
let distance = sqrt(dx * dx + dy * dy)

let isPinching = distance < 0.05  // Adjust threshold

// State machine for evidence accumulation
if isPinching {
    pinchFrameCount += 1
    if pinchFrameCount >= 3 {
        state = .pinched
    }
} else {
    pinchFrameCount = max(0, pinchFrameCount - 1)
    if pinchFrameCount == 0 {
        state = .apart
    }
}

Cost: 2 hours implementation, 20 min ongoing

Pattern 7: Separate Multiple People

Use case: Apply different effects to each person or count people.

let request = VNGeneratePersonInstanceMaskRequest()
try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

let peopleCount = observation.allInstances.count  // Up to 4

for personIndex in observation.allInstances {
    let personMask = try observation.createScaledMask(
        for: IndexSet(integer: personIndex),
        croppedToInstancesContent: false
    )

    // Apply effect to this person only
    applyEffect(to: personMask, personIndex: personIndex)
}

Crowded scenes (>4 people):

// Count faces to detect crowding
let faceRequest = VNDetectFaceRectanglesRequest()
try handler.perform([faceRequest])

let faceCount = faceRequest.results?.count ?? 0

if faceCount > 4 {
    // Fallback: Use single mask for all people
    let singleMaskRequest = VNGeneratePersonSegmentationRequest()
    try handler.perform([singleMaskRequest])
}

Cost: 1.5 hours implementation, 15 min ongoing

Pattern 8: Body Pose for Action Classification

Use case: Fitness app that recognizes exercises (jumping jacks, squats, etc.)

// 1. Collect body pose observations
var poseObservations: [VNHumanBodyPoseObservation] = []

let request = VNDetectHumanBodyPoseRequest()
try handler.perform([request])

if let observation = request.results?.first as? VNHumanBodyPoseObservation {
    poseObservations.append(observation)
}

// 2. When you have 60 frames of poses, prepare for CreateML model
if poseObservations.count == 60 {
    var multiArray = try MLMultiArray(
        shape: [60, 18, 3],  // 60 frames, 18 joints, (x, y, confidence)
        dataType: .double
    )

    for (frameIndex, observation) in poseObservations.enumerated() {
        let allPoints = try observation.recognizedPoints(.all)

        for (jointIndex, (_, point)) in allPoints.enumerated() {
            multiArray[[frameIndex, jointIndex, 0] as [NSNumber]] = NSNumber(value: point.location.x)
            multiArray[[frameIndex, jointIndex, 1] as [NSNumber]] = NSNumber(value: point.location.y)
            multiArray[[frameIndex, jointIndex, 2] as [NSNumber]] = NSNumber(value: point.confidence)
        }
    }

    // 3. Run inference with CreateML model
    let input = YourActionClassifierInput(poses: multiArray)
    let output = try actionClassifier.prediction(input: input)

    let action = output.label  // "jumping_jacks", "squats", etc.
}

Cost: 3-4 hours implementation, 1 hour ongoing

Pattern 9: Text Recognition (OCR)

Use case: Extract text from images, receipts, signs, documents.

let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate  // Or .fast for real-time
request.recognitionLanguages = ["en-US"]  // Specify known languages
request.usesLanguageCorrection = true  // Helps accuracy

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

guard let observations = request.results as? [VNRecognizedTextObservation] else {
    return
}

for observation in observations {
    // Get top candidate (most likely)
    guard let candidate = observation.topCandidates(1).first else { continue }

    let text = candidate.string
    let confidence = candidate.confidence

    // Get bounding box for specific substring
    if let range = text.range(of: searchTerm) {
        if let boundingBox = try? candidate.boundingBox(for: range) {
            // Use for highlighting
        }
    }
}

Fast vs Accurate:

Fast: Real-time camera, large legible text (signs, billboards), character-by-character
Accurate: Documents, receipts, small text, handwriting, ML-based word/line recognition

Language tips:

Order matters: first language determines ML model for accurate path
Use automaticallyDetectsLanguage = true only when language unknown
Query supportedRecognitionLanguages for current revision

Cost: 30 min basic implementation, 2 hours with language handling

Pattern 10: Barcode/QR Code Detection

Use case: Scan product barcodes, QR codes, healthcare codes.

let request = VNDetectBarcodesRequest()
request.revision = VNDetectBarcodesRequestRevision3  // ML-based, iOS 16+
request.symbologies = [.qr, .ean13]  // Specify only what you need!

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

guard let observations = request.results as? [VNBarcodeObservation] else {
    return
}

for barcode in observations {
    let payload = barcode.payloadStringValue  // Decoded content
    let symbology = barcode.symbology  // Type of barcode
    let bounds = barcode.boundingBox  // Location (normalized)

    print("Found \(symbology): \(payload ?? "no string")")
}

Performance tip: Specifying fewer symbologies = faster scanning

Revision differences:

Revision 1: One code at a time, 1D codes return lines
Revision 2: Codabar, GS1Databar, MicroPDF, MicroQR, better with ROI
Revision 3: ML-based, multiple codes at once, better bounding boxes, fewer duplicates

Cost: 15 min implementation

Pattern 11: DataScannerViewController (Live Scanning)

Use case: Camera-based text/barcode scanning with built-in UI (iOS 16+).

import VisionKit

// Check support
guard DataScannerViewController.isSupported,
      DataScannerViewController.isAvailable else {
    // Not supported or camera access denied
    return
}

// Configure what to scan
let recognizedDataTypes: Set<DataScannerViewController.RecognizedDataType> = [
    .barcode(symbologies: [.qr]),
    .text(textContentType: .URL)  // Or nil for all text
]

// Create and present
let scanner = DataScannerViewController(
    recognizedDataTypes: recognizedDataTypes,
    qualityLevel: .balanced,  // Or .fast, .accurate
    recognizesMultipleItems: false,  // Center-most if false
    isHighFrameRateTrackingEnabled: true,  // For smooth highlights
    isPinchToZoomEnabled: true,
    isGuidanceEnabled: true,
    isHighlightingEnabled: true
)

scanner.delegate = self
present(scanner, animated: true) {
    try? scanner.startScanning()
}

Delegate methods:

func dataScanner(_ scanner: DataScannerViewController,
                 didTapOn item: RecognizedItem) {
    switch item {
    case .text(let text):
        print("Tapped text: \(text.transcript)")
    case .barcode(let barcode):
        print("Tapped barcode: \(barcode.payloadStringValue ?? "")")
    @unknown default: break
    }
}

// For custom highlights
func dataScanner(_ scanner: DataScannerViewController,
                 didAdd addedItems: [RecognizedItem],
                 allItems: [RecognizedItem]) {
    for item in addedItems {
        let highlight = createHighlight(for: item)
        scanner.overlayContainerView.addSubview(highlight)
    }
}

Async stream alternative:

for await items in scanner.recognizedItems {
    // Process current items
}

Cost: 45 min implementation with custom highlights

Pattern 12: Document Scanning with VNDocumentCameraViewController

Use case: Scan paper documents with automatic edge detection and perspective correction.

import VisionKit

let documentCamera = VNDocumentCameraViewController()
documentCamera.delegate = self
present(documentCamera, animated: true)

// In delegate
func documentCameraViewController(_ controller: VNDocumentCameraViewController,
                                   didFinishWith scan: VNDocumentCameraScan) {
    controller.dismiss(animated: true)

    // Process each page
    for pageIndex in 0..<scan.pageCount {
        let image = scan.imageOfPage(at: pageIndex)

        // Now run text recognition on the corrected image
        let handler = VNImageRequestHandler(cgImage: image.cgImage!)
        let textRequest = VNRecognizeTextRequest()
        try? handler.perform([textRequest])
    }
}

Cost: 30 min implementation

Pattern 13: Document Segmentation (Custom Pipeline)

Use case: Detect document edges programmatically for custom camera UI.

let request = VNDetectDocumentSegmentationRequest()
let handler = VNImageRequestHandler(ciImage: inputImage)
try handler.perform([request])

guard let observation = request.results?.first,
      let document = observation as? VNRectangleObservation else {
    return
}

// Get corner points (normalized coordinates)
let topLeft = document.topLeft
let topRight = document.topRight
let bottomLeft = document.bottomLeft
let bottomRight = document.bottomRight

// Apply perspective correction with CoreImage
let correctedImage = inputImage
    .cropped(to: document.boundingBox.scaled(to: imageSize))
    .applyingFilter("CIPerspectiveCorrection", parameters: [
        "inputTopLeft": CIVector(cgPoint: topLeft.scaled(to: imageSize)),
        "inputTopRight": CIVector(cgPoint: topRight.scaled(to: imageSize)),
        "inputBottomLeft": CIVector(cgPoint: bottomLeft.scaled(to: imageSize)),
        "inputBottomRight": CIVector(cgPoint: bottomRight.scaled(to: imageSize))
    ])

VNDetectDocumentSegmentationRequest vs VNDetectRectanglesRequest:

Document: ML-based, trained on documents, handles non-rectangles, returns one document
Rectangle: Edge-based, finds any quadrilateral, returns multiple, CPU-only

Cost: 1-2 hours implementation

Pattern 14: Structured Document Extraction (iOS 26+)

Use case: Extract tables, lists, paragraphs with semantic understanding.

// iOS 26+
let request = RecognizeDocumentsRequest()
let observations = try await request.perform(on: imageData)

guard let document = observations.first?.document else {
    return
}

// Extract tables
for table in document.tables {
    for row in table.rows {
        for cell in row {
            let text = cell.content.text.transcript
            print("Cell: \(text)")
        }
    }
}

// Get detected data (emails, phones, URLs, dates)
let allDetectedData = document.text.detectedData
for data in allDetectedData {
    switch data.match.details {
    case .emailAddress(let email):
        print("Email: \(email.emailAddress)")
    case .phoneNumber(let phone):
        print("Phone: \(phone.phoneNumber)")
    case .link(let url):
        print("URL: \(url)")
    default: break
    }
}

Document hierarchy:

Document → containers (text, tables, lists, barcodes)
Table → rows → cells → content
Content → text (transcript, lines, paragraphs, words, detectedData)

Cost: 1 hour implementation

Pattern 15: Real-time Phone Number Scanner

Use case: Scan phone numbers from camera like barcode scanner (from WWDC 2019).

// 1. Use region of interest to guide user
let textRequest = VNRecognizeTextRequest { request, error in
    guard let observations = request.results as? [VNRecognizedTextObservation] else { return }

    for observation in observations {
        guard let candidate = observation.topCandidates(1).first else { continue }

        // Use domain knowledge to filter
        if let phoneNumber = self.extractPhoneNumber(from: candidate.string) {
            self.stringTracker.add(phoneNumber)
        }
    }

    // Build evidence over frames
    if let stableNumber = self.stringTracker.getStableString(threshold: 10) {
        self.foundPhoneNumber(stableNumber)
    }
}

textRequest.recognitionLevel = .fast  // Real-time
textRequest.usesLanguageCorrection = false  // Codes, not natural text
textRequest.regionOfInterest = guidanceBox  // Crop to user's focus area

// 2. String tracker for stability
class StringTracker {
    private var seenStrings: [String: Int] = [:]

    func add(_ string: String) {
        seenStrings[string, default: 0] += 1
    }

    func getStableString(threshold: Int) -> String? {
        seenStrings.first { $0.value >= threshold }?.key
    }
}

Key techniques from WWDC 2019:

Use .fast recognition level for real-time
Disable language correction for codes/numbers
Use region of interest to improve speed and focus
Build evidence over multiple frames (string tracker)
Apply domain knowledge (phone number regex)

Cost: 2 hours implementation

Anti-Patterns

Anti-Pattern 1: Processing on Main Thread

Wrong:

let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])  // Blocks UI!

Right:

DispatchQueue.global(qos: .userInitiated).async {
    let request = VNGenerateForegroundInstanceMaskRequest()
    let handler = VNImageRequestHandler(cgImage: image)
    try handler.perform([request])

    DispatchQueue.main.async {
        // Update UI
    }
}

Why it matters: Vision is resource-intensive. Blocking main thread freezes UI.

Anti-Pattern 2: Ignoring Confidence Scores

Wrong:

let thumbTip = try observation.recognizedPoint(.thumbTip)
let location = thumbTip.location  // May be unreliable!

Right:

let thumbTip = try observation.recognizedPoint(.thumbTip)
guard thumbTip.confidence > 0.5 else {
    // Low confidence - landmark unreliable
    return
}
let location = thumbTip.location

Why it matters: Low confidence points are inaccurate (occlusion, blur, edge of frame).

Anti-Pattern 3: Forgetting Coordinate Conversion

Wrong (mixing coordinate systems):

// Vision uses lower-left origin
let visionPoint = recognizedPoint.location  // (0, 0) = bottom-left

// UIKit uses top-left origin
let uiPoint = CGPoint(x: axiom-visionPoint.x, y: axiom-visionPoint.y)  // WRONG!

Right:

let visionPoint = recognizedPoint.location

// Convert to UIKit coordinates
let uiPoint = CGPoint(
    x: axiom-visionPoint.x * imageWidth,
    y: (1 - visionPoint.y) * imageHeight  // Flip Y axis
)

Why it matters: Mismatched origins cause UI overlays to appear in wrong positions.

Anti-Pattern 4: Setting maximumHandCount Too High

Wrong:

let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 10  // "Just in case"

Right:

let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 2  // Only compute what you need

Why it matters: Performance scales with maximumHandCount. Pose computed for all detected hands ≤ max.

Anti-Pattern 5: Using ARKit When Vision Suffices

Wrong (if you don't need AR):

// Requires AR session just for body pose
let arSession = ARBodyTrackingConfiguration()

Right:

// Vision works offline on still images
let request = VNDetectHumanBodyPoseRequest()

Why it matters: ARKit body pose requires rear camera, AR session, supported devices. Vision works everywhere (even offline).

Pressure Scenarios

Scenario 1: "Just Ship the Feature"

Context: Product manager wants subject lifting "like in Photos app" by Friday. You're considering skipping background processing.

Pressure: "It's working on my iPhone 15 Pro, let's ship it."

Reality: Vision blocks UI on older devices. Users on iPhone 12 will experience frozen app.

Correct action:

Implement background queue (15 min)
Add loading indicator (10 min)
Test on iPhone 12 or earlier (5 min)

Push-back template: "Subject lifting works, but it freezes the UI on older devices. I need 30 minutes to add background processing and prevent 1-star reviews."

Scenario 2: "Training Our Own Model"

Context: Designer wants to exclude hands from subject bounding box. Engineer suggests training custom CoreML model for specific object detection.

Pressure: "We need perfect bounds, let's train a model."

Reality: Training requires labeled dataset (weeks), ongoing maintenance, and still won't generalize to new objects. Built-in Vision APIs + hand pose solve it in 2-5 hours.

Correct action:

Explain Pattern 1 (combine subject mask + hand pose)
Prototype in 1 hour to demonstrate
Compare against training timeline (weeks vs hours)

Push-back template: "Training a model takes weeks and only works for specific objects. I can combine Vision APIs to solve this in a few hours and it'll work for any object."

Scenario 3: "We Can't Wait for iOS 17"

Context: You need instance masks but app supports iOS 15+.

Pressure: "Just use iOS 15 person segmentation and ship it."

Reality: VNGeneratePersonSegmentationRequest (iOS 15) returns single mask for all people. Doesn't solve multi-person use case.

Correct action:

Raise minimum deployment target to iOS 17 (best UX)
OR implement fallback: use iOS 15 API but disable multi-person features
OR use @available to conditionally enable features

Push-back template: "Person segmentation on iOS 15 combines all people into one mask. We can either require iOS 17 for the best experience, or disable multi-person features on older OS versions. Which do you prefer?"

Checklist

Before shipping Vision features:

Performance:

☑ All Vision requests run on background queue
☑ UI shows loading indicator during processing
☑ Tested on iPhone 12 or earlier (not just latest devices)
☑ maximumHandCount set to minimum needed value

Accuracy:

☑ Confidence scores checked before using landmarks
☑ Fallback behavior for low confidence observations
☑ Handles case where no subjects/hands/people detected

Coordinates:

☑ Vision coordinates (lower-left origin) converted to UIKit (top-left)
☑ Normalized coordinates scaled to pixel dimensions
☑ UI overlays aligned correctly with image

Platform Support:

☑ @available checks for iOS 17+ APIs (instance masks)
☑ Fallback for iOS 14-16 (or raised deployment target)
☑ Tested on actual devices, not just simulator

Edge Cases:

☑ Handles images with no detectable subjects
☑ Handles partially occluded hands/bodies
☑ Handles hands/bodies near image edges
☑ Handles >4 people for person instance segmentation

CoreImage Integration (if applicable):

☑ HDR preservation verified with high dynamic range images
☑ Mask resolution matches source image
☑ croppedToInstancesContent set appropriately (false for compositing)

Text/Barcode Recognition (if applicable):

☑ Recognition level matches use case (fast for real-time, accurate for documents)
☑ Language correction disabled for codes/serial numbers
☑ Barcode symbologies limited to actual needs (performance)
☑ Region of interest used to focus scanning area
☑ Multiple candidates checked (not just top candidate)
☑ Evidence accumulated over frames for real-time (string tracker)
☑ DataScannerViewController availability checked before presenting

Resources

WWDC: 2019-234, 2021-10041, 2022-10024, 2022-10025, 2025-272, 2023-10176, 2023-111241, 2020-10653

Docs: /vision, /visionkit, /vision/vnrecognizetextrequest, /vision/vndetectbarcodesrequest

Skills: axiom-vision-ref, axiom-vision-diag

axiom-vision

Vision Framework Computer Vision

When to Use This Skill

Example Prompts

Red Flags

Mandatory First Steps

1. Choose the Right API (Decision Tree)

2. Set Up Background Processing

3. Verify Platform Availability

Common Patterns

Pattern 1: Isolate Object While Excluding Hand

Pattern 2: VisionKit Simple Subject Lifting

Pattern 3: Programmatic Subject Access (VisionKit)

Pattern 4: Vision Instance Mask for Custom Pipeline

Pattern 5: Tap-to-Select Instance

Pattern 6: Hand Gesture Recognition (Pinch)

Pattern 7: Separate Multiple People

Pattern 8: Body Pose for Action Classification

Pattern 9: Text Recognition (OCR)

Pattern 10: Barcode/QR Code Detection

Pattern 11: DataScannerViewController (Live Scanning)

Pattern 12: Document Scanning with VNDocumentCameraViewController

Pattern 13: Document Segmentation (Custom Pipeline)

Pattern 14: Structured Document Extraction (iOS 26+)

Pattern 15: Real-time Phone Number Scanner

Anti-Patterns

Anti-Pattern 1: Processing on Main Thread

Anti-Pattern 2: Ignoring Confidence Scores

Anti-Pattern 3: Forgetting Coordinate Conversion

Anti-Pattern 4: Setting maximumHandCount Too High

Anti-Pattern 5: Using ARKit When Vision Suffices

Pressure Scenarios

Scenario 1: "Just Ship the Feature"

Scenario 2: "Training Our Own Model"

Scenario 3: "We Can't Wait for iOS 17"

Checklist

Resources

Similar Skills