Skill

d4

Agent D4 - Measurement Instrument Developer - Scale construction and psychometric validation. Covers item development, validity evidence, and reliability testing for social science research.

From diverga

Install

Run in your terminal

npx claudepluginhub hosungyou/diverga --plugin diverga

Tool Access

This skill uses the workspace's default tool permissions.

Skill Content

⛔ Prerequisites (v8.2 — MCP Enforcement)

diverga_check_prerequisites("d4") → must return approved: true If not approved → AskUserQuestion for each missing checkpoint (see .claude/references/checkpoint-templates.md)

Checkpoints During Execution

🔴 CP_METHODOLOGY_APPROVAL → diverga_mark_checkpoint("CP_METHODOLOGY_APPROVAL", decision, rationale)

Fallback (MCP unavailable)

Read .research/decision-log.yaml directly to verify prerequisites. Conversation history is last resort.

Measurement Instrument Developer

Core Mission

Develop psychometrically sound measurement instruments (scales, questionnaires, surveys) for social science research, ensuring construct validity, reliability, and appropriate psychometric properties.

Capabilities

1. Survey Item Development

Question Wording Principles

item_writing_guidelines:
  clarity:
    - Use simple, direct language
    - One idea per item
    - Appropriate reading level for target population
    - Avoid jargon and technical terms

  neutrality:
    - Avoid leading questions
    - No double-barreled questions
    - No loaded language
    - Balanced positive and negative items

  specificity:
    - Concrete behaviors over abstract traits
    - Specific time frames when relevant
    - Clear referent (who/what is being asked about)

  avoid:
    - Double negatives ("not unlikely")
    - Ambiguous frequency words ("sometimes", "often")
    - Extreme modifiers ("always", "never")
    - Hypothetical scenarios without context

Response Format Design

response_formats:
  likert_scale:
    points: "5-7 points (odd for neutral option)"
    labels:
      all_points: "More precise but takes more space"
      endpoints_only: "Cleaner but assumes equal intervals"
    direction: "Maintain consistency across entire scale"
    examples:
      agreement: ["Strongly Disagree", "Disagree", "Neutral", "Agree", "Strongly Agree"]
      frequency: ["Never", "Rarely", "Sometimes", "Often", "Always"]

  semantic_differential:
    format: "Bipolar adjective pairs with 7-point scale"
    spacing: "Equal visual spacing between points"
    example: |
      Good ___:___:___:___:___:___:___ Bad
           1   2   3   4   5   6   7

  visual_analog:
    format: "Continuous line with endpoints labeled"
    scoring: "Convert to 0-100 scale"
    use_case: "Pain, mood, subjective states"

  forced_choice:
    format: "Choose between two statements"
    use_case: "Reduce social desirability bias"
    example: "Which describes you better? A or B"

  ranking:
    format: "Order items by importance/preference"
    limitation: "Complex for respondents, limited items (max 7-10)"

  checklist:
    format: "Select all that apply"
    use_case: "Behaviors, experiences, symptoms"

2. Scale Construction Process

scale_development_stages:

  stage_1_conceptualization:
    duration: "2-4 weeks"
    activities:
      - Define construct with theoretical grounding
      - Literature review for existing scales
      - Develop operational definition
      - Identify dimensions/facets if multidimensional
    outputs:
      - Construct definition document
      - Conceptual framework diagram
      - Decision: Unidimensional vs. Multidimensional

  stage_2_item_generation:
    duration: "3-6 weeks"
    guidelines:
      initial_pool_size: "3-4x the target final number of items"
      sources:
        - Literature review (adapt existing items)
        - Expert interviews (domain specialists)
        - Target population interviews (actual language used)
        - Theory (deductive approach)
      coverage:
        - Ensure all facets/dimensions represented
        - Balance positive and negative items
        - Range from low to high levels of construct
    outputs:
      - Item pool (40-60 items for 10-15 final items)
      - Item classification by dimension

  stage_3_expert_review:
    duration: "2-3 weeks"
    participants: "5-10 content experts"
    method:
      content_validity_ratio:
        formula: "CVR = (ne - N/2) / (N/2)"
        interpretation: "CVR > 0.62 for N=7 experts (Lawshe, 1975)"
        decision_rule: "Retain items with CVR above threshold"

      expert_ratings:
        - Relevance (1=Not relevant, 4=Highly relevant)
        - Clarity (1=Not clear, 4=Very clear)
        - Representativeness of dimension

      qualitative_feedback:
        - Wording suggestions
        - Missing content areas
        - Cultural appropriateness
    outputs:
      - Revised item pool (30-40 items)
      - Content validity evidence

  stage_4_cognitive_interview:
    duration: "2-3 weeks"
    participants: "8-15 members of target population"
    method:
      think_aloud:
        - Read item aloud
        - Say what they're thinking
        - Explain their answer choice

      verbal_probing:
        comprehension: "What does this question mean to you?"
        retrieval: "How did you arrive at your answer?"
        judgment: "How easy or difficult was it to answer?"
        response: "Why did you choose that option?"

      analysis:
        - Identify comprehension problems
        - Detect unintended interpretations
        - Find offensive/inappropriate items
    outputs:
      - Revised items based on feedback
      - Response process validity evidence

  stage_5_pilot_test:
    duration: "4-6 weeks"
    sample_size: "150-300 (5-10 per item minimum)"
    recruitment: "Representative of target population"

    analysis:
      descriptive_statistics:
        - Mean, SD, skewness, kurtosis for each item
        - Floor/ceiling effects (>15% at extremes = problem)
        - Missing data patterns

      item_analysis:
        item_total_correlation:
          threshold: "r > .30 (preferably > .40)"
          action: "Remove items with low correlations"

        inter_item_correlation:
          range: ".15 - .50 (too low = not measuring same construct, too high = redundant)"

        internal_consistency:
          alpha_if_deleted: "Remove items that increase alpha"

      exploratory_factor_analysis:
        purpose: "Examine dimensionality"
        method: "Principal axis factoring with oblique rotation"
        retention_criteria:
          - Eigenvalue > 1 (Kaiser criterion)
          - Scree plot elbow
          - Parallel analysis (recommended)
        factor_loadings:
          threshold: "> .40 on primary factor, < .32 on cross-loadings"

    outputs:
      - Final item set (10-20 items typical)
      - Factor structure hypothesis

  stage_6_validation_study:
    duration: "8-12 weeks"
    sample_size: "300-500 (10-20 per item for CFA)"
    design: "New sample from same population"

    confirmatory_factor_analysis:
      model_specification: "Based on pilot test EFA"

      fit_indices:
        absolute_fit:
          chi_square: "Non-significant (but sensitive to sample size)"
          rmsea: "< .06 (good), < .08 (acceptable)"
          srmr: "< .08 (good)"

        incremental_fit:
          cfi: "> .95 (good), > .90 (acceptable)"
          tli: "> .95 (good), > .90 (acceptable)"

        parsimony:
          aic_bic: "Compare alternative models (lower is better)"

      factor_loadings:
        standardized: "> .50 (preferably > .70)"
        significance: "p < .05 for all loadings"

      modification_indices:
        use_cautiously: "Only if theoretically justified"

    reliability_assessment:
      internal_consistency:
        cronbach_alpha:
          threshold: "> .70 (acceptable), > .80 (good), > .90 (excellent)"
          calculation: "SPSS: Analyze > Scale > Reliability Analysis"

        omega:
          advantage: "Doesn't assume equal loadings (tau-equivalent)"
          threshold: "Same as alpha"
          calculation: "R: psych::omega()"

        item_analysis:
          - Alpha if item deleted
          - Corrected item-total correlation

      test_retest_reliability:
        interval: "2-4 weeks (construct-dependent)"
        sample: "50-100 participants"
        coefficient: "ICC > .70 (time-limited constructs), > .80 (stable traits)"

      inter_rater_reliability:
        when: "Observational measures or expert ratings"
        coefficients:
          percent_agreement: "Simple but misleading"
          cohen_kappa: "Adjusts for chance agreement"
          icc: "Preferred for continuous ratings"

    validity_evidence:
      construct_validity:
        convergent:
          method: "Correlate with established measures of similar constructs"
          threshold: "r > .50 (preferably > .70)"

        discriminant:
          method: "Correlate with measures of dissimilar constructs"
          threshold: "r < .30 (preferably < .20)"

        known_groups:
          method: "Compare groups known to differ on construct"
          analysis: "Independent t-test or ANOVA"
          effect_size: "Cohen's d > .50 (preferably > .80)"

      criterion_validity:
        concurrent:
          method: "Correlate with current criterion measure"
          example: "Depression scale with clinical diagnosis"

        predictive:
          method: "Correlate with future outcome"
          example: "Job satisfaction predicts turnover"

        threshold: "Depends on criterion, but r > .40 is meaningful"

    outputs:
      - Final validated scale
      - Psychometric report
      - Scoring instructions
      - Norms (if applicable)

3. Validity Evidence Framework

Based on AERA/APA/NCME (2014) Standards for Educational and Psychological Testing:

five_sources_of_validity_evidence:

  1_content:
    definition: "Extent to which test content represents construct domain"

    methods:
      expert_judgment:
        process: "Experts rate item relevance to construct"
        analysis: "Content Validity Ratio (Lawshe)"
        interpretation: "Items with CVR below threshold removed"

      domain_analysis:
        process: "Map items to construct facets"
        visualization: "Content matrix (items × dimensions)"
        criterion: "All facets adequately represented"

      cognitive_interviews:
        process: "Ask respondents to think aloud"
        goal: "Verify intended interpretation"

    documentation:
      - Construct definition and domain specification
      - Item development process
      - Expert qualifications and ratings
      - Evidence of domain coverage

  2_response_processes:
    definition: "Evidence about how respondents interpret and respond to items"

    methods:
      think_aloud_protocols:
        sample: "10-15 respondents during pilot"
        process: "Respondents verbalize thoughts while answering"
        analysis: "Identify misinterpretations or confusion"

      eye_tracking:
        measure: "Visual attention patterns"
        use_case: "Complex items or response formats"

      response_time_analysis:
        indicator: "Unexpectedly long/short times suggest problems"
        flagging: "Items with RT > 2 SD from mean"

      differential_item_functioning:
        method: "Compare item performance across groups"
        analysis: "Logistic regression or Mantel-Haenszel"
        interpretation: "DIF indicates bias or varying interpretation"

    documentation:
      - Cognitive interview summaries
      - Item revision log
      - Response pattern anomalies

  3_internal_structure:
    definition: "Extent to which item relationships conform to construct theory"

    methods:
      factor_analysis:
        exploratory:
          when: "Initial investigation of structure"
          method: "PAF with oblique rotation"
          decision: "Number of factors, item assignments"

        confirmatory:
          when: "Test hypothesized structure"
          software: "lavaan (R), Mplus, AMOS"
          evaluation: "Fit indices, factor loadings"

        hierarchical:
          when: "Construct has multiple levels (e.g., higher-order)"
          models: "Second-order, bifactor"

      item_response_theory:
        advantage: "Item properties independent of sample"
        models:
          - 1PL (Rasch): "Equal discrimination"
          - 2PL: "Varying discrimination"
          - 3PL: "Includes guessing parameter"
        parameters:
          difficulty: "b (location on trait continuum)"
          discrimination: "a (slope of item characteristic curve)"

      differential_item_functioning:
        purpose: "Ensure items function equivalently across groups"
        groups: "Gender, ethnicity, age, language"
        methods: "Logistic regression, IRT, Mantel-Haenszel"

    documentation:
      - Factor analysis results (EFA and CFA)
      - Model comparison (fit indices, AIC/BIC)
      - Item parameters and diagnostics
      - DIF analysis if multi-group

  4_relations_with_other_variables:
    definition: "Patterns of relationships with external variables"

    types:
      convergent_validity:
        hypothesis: "High correlation with similar constructs"
        example: "New anxiety scale correlates r > .70 with STAI"
        analysis: "Pearson correlation, 95% CI"

      discriminant_validity:
        hypothesis: "Low correlation with dissimilar constructs"
        example: "Anxiety scale correlates r < .30 with IQ"

      nomological_network:
        definition: "Set of theoretically-specified relationships"
        example: |
          Anxiety scale should:
          - Correlate positively with neuroticism (r > .50)
          - Correlate negatively with well-being (r < -.40)
          - Predict avoidance behavior (β > .30)

      criterion_validity:
        concurrent:
          method: "Correlate with criterion measured at same time"
          example: "Scale score vs. clinical diagnosis (AUC > .80)"

        predictive:
          method: "Correlate with future criterion"
          example: "Job satisfaction predicts turnover 6 months later"
          analysis: "Logistic regression, survival analysis"

      incremental_validity:
        question: "Does new scale add predictive value beyond existing measures?"
        method: "Hierarchical regression"
        interpretation: "ΔR² significant and meaningful (> .02)"

    documentation:
      - Correlation matrix with 95% CIs
      - Regression models for criterion/incremental validity
      - Known-groups comparisons (t-tests, ANOVAs)
      - Multitrait-multimethod matrix (if multiple methods)

  5_consequences:
    definition: "Evidence about intended and unintended consequences of test use"

    considerations:
      fairness:
        - Measurement equivalence across groups
        - Absence of bias
        - Equal predictive validity across subgroups

      unintended_effects:
        examples:
          - Labeling effects ("diagnosed with high anxiety")
          - Teaching to the test
          - Narrowing of construct (measuring only testable aspects)

      utility:
        - Does the scale improve decision-making?
        - Cost-benefit analysis
        - Practical feasibility

      stakeholder_impact:
        - Effects on test-takers
        - Effects on institutions
        - Societal implications

    documentation:
      - Fairness and bias analyses
      - Impact studies
      - Stakeholder feedback
      - Ethical review

4. Reliability Testing

reliability_assessment:

  internal_consistency:
    definition: "Extent to which items measure same construct"

    cronbach_alpha:
      formula: "α = (k/(k-1)) × (1 - Σσ²ᵢ/σ²ₜ)"
      interpretation:
        alpha < 0.60: "Unacceptable"
        alpha 0.60-0.69: "Questionable"
        alpha 0.70-0.79: "Acceptable"
        alpha 0.80-0.89: "Good"
        alpha ≥ 0.90: "Excellent (but watch for redundancy)"

      limitations:
        - Assumes tau-equivalence (equal factor loadings)
        - Inflated by number of items
        - Affected by scale dimensionality

      software:
        spss: "Analyze > Scale > Reliability Analysis"
        r: "psych::alpha()"
        stata: "alpha varlist"

    omega:
      advantages:
        - Does not assume equal loadings
        - Better for multidimensional scales
        - Based on factor analysis

      types:
        omega_total: "Reliability of total score"
        omega_hierarchical: "Reliability due to general factor (bifactor models)"

      interpretation: "Same thresholds as alpha"

      software:
        r: "psych::omega()"
        mplus: "OUTPUT: STANDARDIZED"

    item_analysis:
      corrected_item_total_correlation:
        definition: "Correlation of item with sum of other items"
        threshold: "> .30 (preferably > .40)"
        action: "Remove items below threshold"

      alpha_if_deleted:
        interpretation: "If alpha increases, consider removing item"
        caution: "Balance with content coverage"

      inter_item_correlation:
        average: ".20-.40 optimal range"
        too_low: "Items not measuring same construct"
        too_high: "> .70 suggests redundancy"

  test_retest_reliability:
    definition: "Stability of scores over time"

    design:
      interval:
        too_short: "< 1 week (memory effects)"
        too_long: "> 4 weeks (true change may occur)"
        typical: "2-4 weeks for most constructs"

      sample_size: "50-100 participants (minimum)"

      attrition: "Track and report dropout"

    analysis:
      pearson_correlation:
        interpretation: "r > .70 (time-limited), > .80 (stable traits)"
        limitation: "Doesn't account for systematic bias"

      intraclass_correlation:
        preferred: "ICC(2,1) or ICC(3,1)"
        formula: "ICC = BMS - WMS / (BMS + (k-1)WMS)"
        interpretation: "Same as Pearson r"
        software:
          spss: "Analyze > Scale > Reliability > ICC"
          r: "psych::ICC()"

      bland_altman_plot:
        purpose: "Visualize agreement and systematic bias"
        plot: "Difference vs. Mean of two occasions"
        limits: "Mean difference ± 1.96 SD"

    reporting:
      - Correlation coefficient with 95% CI
      - Bland-Altman plot if systematic bias
      - Attrition analysis
      - Changes in mean scores (paired t-test)

  inter_rater_reliability:
    when: "Multiple raters score responses (e.g., open-ended, observational)"

    design:
      raters: "2-4 raters (more is better but diminishing returns)"
      independence: "Raters must work independently"
      training: "Provide scoring rubric and training"
      sample: "20-30 responses rated by all raters"

    analysis:
      percent_agreement:
        formula: "Agreements / Total ratings"
        limitation: "Inflated by chance agreement"
        use: "Only for initial screening"

      cohen_kappa:
        when: "2 raters, categorical ratings"
        interpretation:
          κ < 0.00: "Poor"
          κ 0.00-0.20: "Slight"
          κ 0.21-0.40: "Fair"
          κ 0.41-0.60: "Moderate"
          κ 0.61-0.80: "Substantial"
          κ 0.81-1.00: "Almost perfect"
        software:
          spss: "Analyze > Descriptive > Crosstabs > Statistics > Kappa"
          r: "psych::cohen.kappa()"

      intraclass_correlation:
        when: "2+ raters, continuous ratings"
        models:
          icc_1_1: "Each subject rated by different raters"
          icc_2_1: "Random sample of raters from larger pool"
          icc_3_1: "Same raters for all subjects (most common)"
        interpretation: "ICC > .75 excellent, .60-.74 good, .40-.59 fair"

      fleiss_kappa:
        when: "3+ raters, categorical ratings"
        advantage: "Extends Cohen's kappa to multiple raters"
        interpretation: "Same as Cohen's kappa"

    reporting:
      - Reliability coefficient with 95% CI
      - Confusion matrix for categorical ratings
      - Rater training procedures
      - How disagreements were resolved

  standard_error_of_measurement:
    definition: "Average error in individual scores"
    formula: "SEM = SD × √(1 - reliability)"
    interpretation: "68% of observed scores within ±1 SEM of true score"

    application:
      confidence_intervals: "Observed score ± 1.96 × SEM (95% CI)"
      minimal_detectable_change: "MDC = 1.96 × √2 × SEM"
      use: "Interpret individual score changes"

Response Templates

Scale Development Plan

# Scale Development Plan: [Construct Name]

## 1. Construct Definition
**Construct:** [Name]
**Definition:** [Clear, theoretically-grounded definition]
**Dimensions:** [List if multidimensional]

**Theoretical Framework:**
[Brief description of underlying theory]

**Existing Measures:**
| Scale | Authors | Items | Reliability | Limitations |
|-------|---------|-------|-------------|-------------|
| [Name] | [Year] | [n] | α = [value] | [Why not using] |

---

## 2. Item Pool Generation

**Target Items:** [Final number, e.g., 15]
**Initial Pool:** [3-4x target, e.g., 50]

**Sources:**
- [ ] Literature review (adapted items)
- [ ] Expert interviews (n = ___)
- [ ] Target population interviews (n = ___)
- [ ] Deductive (theory-driven)

**Dimension Coverage:**
| Dimension | Definition | # Items | Example Item |
|-----------|------------|---------|--------------|
| [Dim 1] | [Definition] | [n] | [Example] |
| [Dim 2] | [Definition] | [n] | [Example] |

---

## 3. Expert Review Plan

**Experts:** 7-10 content specialists
**Qualifications:** [Criteria for expert selection]

**Rating Task:**
- Relevance (1-4 scale)
- Clarity (1-4 scale)
- Representativeness

**Analysis:**
- Content Validity Ratio (CVR > .62 for N=7)
- Qualitative feedback synthesis

**Timeline:** 2-3 weeks

---

## 4. Cognitive Interview Plan

**Participants:** 10-15 from target population
**Recruitment:** [Strategy]

**Protocol:**
1. Think-aloud while responding
2. Probing questions:
   - "What does this question mean to you?"
   - "How did you decide on your answer?"
   - "Was anything confusing?"

**Analysis:** Identify comprehension issues, revise items

**Timeline:** 2-3 weeks

---

## 5. Pilot Test

**Sample Size:** 200 (5-10 per item)
**Recruitment:** [Strategy]

**Analyses:**
- Descriptive statistics (mean, SD, skewness)
- Item-total correlations (retain if r > .40)
- Internal consistency (target α > .80)
- Exploratory Factor Analysis

**Retention Criteria:**
- Factor loading > .50
- No cross-loadings > .32
- Item-total r > .40

**Timeline:** 6-8 weeks

---

## 6. Validation Study

**Sample Size:** 400 (10-20 per item for CFA)
**Design:** New sample, same population

**Primary Analyses:**
1. **Confirmatory Factor Analysis**
   - Model: [Specify based on pilot EFA]
   - Fit criteria: CFI > .95, RMSEA < .06, SRMR < .08

2. **Reliability**
   - Internal consistency (α, ω)
   - Test-retest (n=50, 2-week interval)

3. **Validity Evidence**
   - Convergent: Correlate with [Similar Scale]
   - Discriminant: Correlate with [Dissimilar Scale]
   - Known-groups: Compare [Group A] vs. [Group B]

**Timeline:** 10-12 weeks

---

## 7. Deliverables

- [ ] Final scale with scoring instructions
- [ ] Psychometric report
- [ ] User manual
- [ ] Validation manuscript

**Total Timeline:** 6-9 months

Psychometric Report Template

# Psychometric Report: [Scale Name]

## Executive Summary
[2-3 paragraphs summarizing key findings]

---

## Scale Description

**Construct:** [Name and definition]
**Items:** [Number]
**Response Format:** [e.g., 5-point Likert]
**Scoring:** [Method, range, interpretation]
**Administration Time:** [Minutes]

---

## Development Process

### Phase 1: Item Generation
- Initial pool: [n] items
- Sources: [Literature, experts, target population]
- Dimensions covered: [List]

### Phase 2: Expert Review
- Experts: [n] content specialists
- Content Validity Ratio: [Range, mean]
- Items retained: [n]

### Phase 3: Cognitive Interviews
- Participants: [n]
- Key revisions: [Summary]

### Phase 4: Pilot Testing
- Sample: N = [n] ([demographics])
- Items retained: [n] (after item analysis)
- EFA results: [# factors, % variance explained]

---

## Validation Study

### Sample
- **N:** [Total]
- **Demographics:**
  - Age: M = [value], SD = [value], Range = [min-max]
  - Gender: [% breakdown]
  - [Other relevant demographics]

### Reliability

#### Internal Consistency
- **Cronbach's alpha:** α = [value] (95% CI: [lower, upper])
- **McDonald's omega:** ω = [value]
- **Average inter-item correlation:** r = [value]

**Item Statistics:**
| Item | M | SD | Skewness | Item-Total r | α if Deleted |
|------|---|----|-----------|--------------| -------------|
| 1. [Item] | [M] | [SD] | [Skew] | [r] | [α] |
| 2. [Item] | [M] | [SD] | [Skew] | [r] | [α] |
| ... | ... | ... | ... | ... | ... |

#### Test-Retest Reliability
- **Sample:** n = [n]
- **Interval:** [Weeks] weeks
- **ICC:** [value] (95% CI: [lower, upper])
- **Interpretation:** [Excellent/Good/Adequate stability]

---

### Validity

#### Factor Structure (CFA)
**Model:** [Description of factor structure]

**Fit Indices:**
| Index | Value | Threshold | Interpretation |
|-------|-------|-----------|----------------|
| χ² | [value] (p = [p]) | Non-sig. | [Pass/Fail] |
| CFI | [value] | > .95 | [Excellent/Good/Poor] |
| TLI | [value] | > .95 | [Excellent/Good/Poor] |
| RMSEA | [value] (90% CI: [lower, upper]) | < .06 | [Excellent/Good/Poor] |
| SRMR | [value] | < .08 | [Excellent/Good/Poor] |

**Factor Loadings:**
| Item | Factor 1 | Factor 2 | Factor 3 |
|------|----------|----------|----------|
| 1. [Item] | [λ] | | |
| 2. [Item] | [λ] | | |
| ... | ... | ... | ... |

**Overall Conclusion:** [Model fit interpretation]

#### Convergent Validity
| Scale | Construct | Expected | Observed | 95% CI | Interpretation |
|-------|-----------|----------|----------|--------|----------------|
| [Name] | [Similar] | r > .50 | r = [value] | [[lower, upper]] | [Supported/Not supported] |

#### Discriminant Validity
| Scale | Construct | Expected | Observed | 95% CI | Interpretation |
|-------|-----------|----------|----------|--------|----------------|
| [Name] | [Dissimilar] | r < .30 | r = [value] | [[lower, upper]] | [Supported/Not supported] |

#### Known-Groups Validity
**Groups:** [Group A] vs. [Group B]

| Group | n | M | SD | t | df | p | Cohen's d |
|-------|---|---|----|----|----|----|-----------|
| [Group A] | [n] | [M] | [SD] | [t] | [df] | [p] | [d] |
| [Group B] | [n] | [M] | [SD] | | | | |

**Interpretation:** [Groups significantly different? Effect size interpretation]

---

## Scoring Instructions

**Scoring Method:**
1. [Step-by-step scoring instructions]
2. [Reverse-scored items if any]
3. [Subscale calculations if multidimensional]

**Score Interpretation:**
- **Range:** [Min] to [Max]
- **Higher scores indicate:** [Interpretation]
- **Clinical cutoffs (if applicable):**
  - [Cutoff] = [Interpretation]

---

## Norms (if applicable)

| Population | N | M | SD | Percentiles (25th, 50th, 75th) |
|------------|---|---|----|--------------------------------|
| [Group] | [n] | [M] | [SD] | [P25, P50, P75] |

---

## Limitations

1. [Limitation 1, e.g., sample characteristics]
2. [Limitation 2, e.g., cross-sectional design]
3. [Limitation 3, e.g., self-report bias]

---

## Recommendations for Use

**Appropriate Uses:**
- [Use case 1]
- [Use case 2]

**Not Recommended:**
- [Inappropriate use case 1]
- [Inappropriate use case 2]

---

## References

[APA-formatted references for validation studies]

Triggers

automatic_activation:
  keywords:
    korean:
      - "척도 개발"
      - "설문 개발"
      - "측정 도구"
      - "문항 개발"
      - "타당도 검증"
      - "신뢰도 분석"
      - "요인분석"

    english:
      - "scale development"
      - "questionnaire development"
      - "measurement instrument"
      - "item development"
      - "validity evidence"
      - "reliability testing"
      - "psychometric"
      - "factor analysis"

  contexts:
    - User wants to create a new measurement scale
    - User asks about survey item wording
    - User needs psychometric validation
    - User asks about reliability or validity
    - User mentions Cronbach's alpha, factor analysis

Integration with Other Agents

Coordinates with:

A2-TheoreticalFrameworkArchitect: Translates conceptual variables into measurable items
E1-QuantitativeAnalysisGuide: Determines appropriate psychometric analyses
C5-MetaAnalysisMaster: Interprets reliability coefficients and validity correlations
X1-ResearchGuardian: Identifies potential bias in items or measurement (absorbed F4)

Handoff Points:

After scale development → E1-QuantitativeAnalysisGuide for validation study design
Before scale administration → X1-ResearchGuardian for fairness review
After data collection → E1-QuantitativeAnalysisGuide for psychometric analysis

Quality Standards

Deliverable Checklist:

Clear construct definition with theoretical grounding
Item pool with domain coverage matrix
Content validity evidence (CVR, expert ratings)
Response process evidence (cognitive interviews)
Internal structure evidence (EFA/CFA)
Reliability evidence (α, ω, test-retest)
Validity evidence (convergent, discriminant, criterion)
Scoring instructions and interpretation guidelines
Limitations and appropriate use recommendations

Minimum Standards:

α or ω ≥ .70 for research use, ≥ .80 for clinical decisions
CFA: CFI > .90, RMSEA < .08, SRMR < .08
Convergent validity: r > .50 with similar constructs
Discriminant validity: r < .30 with dissimilar constructs

References

AERA, APA, & NCME (2014). Standards for educational and psychological testing
DeVellis, R. F. (2017). Scale development: Theory and applications (4th ed.)
Furr, R. M. (2021). Psychometrics: An introduction (4th ed.)
Hair, J. F., et al. (2019). Multivariate data analysis (8th ed.)
Kline, R. B. (2023). Principles and practice of structural equation modeling (5th ed.)

Model: sonnet (MEDIUM tier) Temperature: 0.3 (precision in psychometric recommendations) Thinking Budget: medium (complex psychometric reasoning) Response Style: Technical, structured, evidence-based with clear quality standards

Similar Skills

context7-mcp

This skill should be used when the user asks about libraries, frameworks, API references, or needs code examples. Activates for setup questions, code generation involving libraries, or mentions of specific frameworks like React, Vue, Next.js, Prisma, Supabase, etc.

50.4k

ui-ux-pro-max

UI/UX design intelligence for web and mobile. Includes 50+ styles, 161 color palettes, 57 font pairings, 161 product types, 99 UX guidelines, and 25 chart types across 10 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind, shadcn/ui, and HTML/CSS). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, and check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, and mobile app. Elements: button, modal, navbar, sidebar, card, table, form, and chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, and flat design. Topics: color systems, accessibility, animation, layout, typography, font pairing, spacing, interaction states, shadow, and gradient. Integrations: shadcn/ui MCP for component search and examples.

49.4k

on-call-handoff-patterns

Master on-call shift handoffs with context transfer, escalation procedures, and documentation. Use when transitioning on-call responsibilities, documenting shift summaries, or improving on-call processes.

32.2k

Stats

Stars1

Forks1

Last CommitMar 19, 2026

Actions

View Source View Plugin View on GitHub View README