metodologia-ai-testing-strategy | mao | ClaudePluginHub
Skill
metodologia-ai-testing-strategy
Comprehensive testing strategy for AI systems — testing scope matrix (6 types x 6 layers), model prediction testing, data quality testing, compliance and fairness testing, integration approaches, and CI/CD test automation. This skill should be used when the user asks to "define AI testing strategy", "test ML models", "design data quality tests", "plan fairness testing", "test AI pipelines", "design integration tests for ML", or mentions adversarial testing, drift simulation, model regression testing, bias testing, explainability testing, or AI test automation.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
AI Testing Strategy: Comprehensive Verification for AI-Enabled Systems
AI testing strategy defines how to verify that an AI system behaves correctly, fairly, securely, and reliably across all layers — from data ingestion through model inference to production monitoring. This skill produces a testing strategy document covering the testing scope matrix, model and prediction tests, data quality tests, compliance and fairness tests, integration approaches, and CI/CD test automation for AI pipelines.
Principio Rector
Si no puedes probarlo, no lo despliegues. En sistemas de IA, "funciona en mi notebook" no es evidencia de calidad. La estrategia de testing debe cubrir las 6 capas del sistema y los 6 tipos de prueba, con automatizacion como requisito, no como aspiracion.
Filosofia de Testing para IA
La matriz completa o nada. Testing parcial en sistemas de IA es peor que no testear — da falsa confianza. Un modelo con 95% accuracy pero sin fairness testing puede ser discriminatorio. Un pipeline con integration tests pero sin data quality tests puede procesar basura silenciosamente.
Data quality testing ES el test mas importante. En sistemas tradicionales, los bugs estan en el codigo. En sistemas de IA, los bugs estan en los datos. Schema validation, distribution testing, lineage tracking, y training-serving skew detection son la primera linea de defensa.
Testing continuo, no testing puntual. Los modelos degradan con el tiempo (drift). Los datos cambian. Las features evolucionan. La estrategia de testing debe incluir monitoreo continuo en produccion, no solo gates en el pipeline de deployment.
Inputs
The user provides a system or project name as $ARGUMENTS. Parse $1 as the system/project name used throughout all output artifacts.
Does not cover infrastructure testing (see metodologia-infrastructure-architecture)
Edge Cases
No Ground Truth Available:
Some AI systems (unsupervised, generative) lack clear ground truth. Use proxy metrics (human evaluation, downstream task performance), A/B testing against baselines, and consistency testing (similar inputs should produce similar outputs).
Regulated Environment with Audit Requirements:
Every test execution must produce evidence artifacts. Test reports must be immutable and timestamped. Consider the Integration Harness as mandatory for reproducible audit-ready testing. Bottom-Up integration approach ensures data compliance is validated first.
Continuous Learning System:
Model updates frequently with new data. Testing strategy must handle continuous model versioning. Regression testing must compare against stable baseline, not just previous version. Drift detection thresholds need regular recalibration.
Multi-Model Ensemble:
Testing individual models is necessary but insufficient. Ensemble behavior must be tested as a unit. Disagreement patterns between models should be analyzed. Voting/aggregation logic needs dedicated tests.
Privacy-Constrained Testing:
Production data cannot be used for testing (GDPR, HIPAA). Synthetic data generation must match production distributions without exposing real data. Differential privacy techniques for test data. Anonymization verification before test data creation.
Validation Gate
Before finalizing delivery, verify:
Testing scope matrix covers all 6 types x 6 layers (cells prioritized, not necessarily all filled)
Model testing includes accuracy, adversarial, drift, counterfactual, and regression tests
Data quality testing covers schema, distribution, lineage, and training-serving skew
Compliance testing addresses governance, audit trails, privacy, and encryption requirements
Fairness testing uses appropriate metrics for the domain with defined thresholds
Integration approach selected and justified (top-down, bottom-up, parallel, harness)
CI/CD automation tiers defined with clear gates and triggers
Primary:A-04_AI_Testing_Strategy_Deep.html — Testing scope matrix (6x6), model test specifications, data quality test plan, compliance and fairness test design, integration approach diagram, CI/CD automation pipeline with gates.
Secondary: Test case templates (.md), fairness test specification, integration harness design, CI/CD gate configuration, test data strategy document.
Fuente: Avila, R.D. & Ahmad, I. (2025). Architecting AI Software Systems. Packt.