From open-science-skills
Specialized logic for cleaning and reshaping choice-based conjoint data from Qualtrics exports into analysis-ready long format. Use when (1) preparing conjoint survey data for analysis, (2) reshaping wide Qualtrics exports to long format, (3) mapping conjoint choice and rating variables to profile-level outcomes, (4) translating attribute labels across languages, (5) diagnosing pilot contamination or data quality issues in conjoint data, or (6) setting AMCE reference categories. Covers Qualtrics column conventions, existing R packages, wide-to-long reshaping, choice variable encoding, attribute-level translation, data validation, and analysis-ready output.
npx claudepluginhub scdenney/open-science-skills --plugin open-science-skillsThis skill uses the workspace's default tool permissions.
**Export format:** When exporting from Qualtrics, select **"Use choice text"** (not "Use numeric values") so that attribute levels appear as human-readable labels. If working with non-Latin scripts (Chinese, Korean, Arabic), export as XLSX rather than CSV to avoid UTF-8/ANSI encoding issues.
Systematic diagnostic checklist for evaluating choice-based conjoint experiments. Use when (1) reviewing a conjoint paper or manuscript, (2) auditing a conjoint analysis script or dataset, (3) assessing measurement error and IRR in conjoint data, (4) evaluating external validity of a conjoint design, or (5) checking interpretation of AMCEs, marginal means, and interaction effects. Covers design, estimation, measurement error correction, external validity, and reporting.
Synthesizes user research from interviews, surveys, feedback into themes, prioritized findings by frequency/impact, and roadmap recommendations.
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Share bugs, ideas, or general feedback.
Export format: When exporting from Qualtrics, select "Use choice text" (not "Use numeric values") so that attribute levels appear as human-readable labels. If working with non-Latin scripts (Chinese, Korean, Arabic), export as XLSX rather than CSV to avoid UTF-8/ANSI encoding issues.
Metadata rows: Current Qualtrics CSV exports include 3 header rows before respondent data: (1) variable identifiers, (2) question text/descriptions, (3) ImportId JSON. Legacy exports have 2 rows. The cjoint::read.qualtrics() parameter new.format = TRUE (set explicitly; default is FALSE) handles the 3-row format. For manual import via readxl::read_excel() or readr::read_csv(), skip the appropriate number of metadata rows after reading headers.
Randomization order columns: If "Export viewing order data" is enabled, Qualtrics adds _DO_ columns (e.g., Block1_DO) containing pipe-separated integers showing element display order. These are useful for task-order robustness checks but are not needed for the core reshape.
Qualtrics conjoint experiments use one of three implementation methods, each producing different column naming conventions:
Method A — Conjoint Survey Design Tool (Strezhnev): Generates JavaScript that Qualtrics executes to randomize profiles. Column naming follows F-{task}-{profile}-{attribute} for attribute levels and F-{task}-{attribute} for attribute names. The cjoint R package's read.qualtrics() function is purpose-built for this format.
Method B — Custom JavaScript + Embedded Data: Researchers write JavaScript to randomize attributes and store values in Qualtrics embedded data fields. Column naming is researcher-defined, commonly C{x}-F-{task}-{idx} for attribute names and C{x}-F-{task}-{profile}-{idx} for profile values. Requires manual reshaping (Section 4).
Method C — Loop & Merge: Each loop iteration represents one conjoint task. Embedded data fields are referenced via ${e://Field/variable_name} and displayed with ${lm://Field/N}. Column names reflect the embedded data field structure. Requires manual reshaping.
Before writing any cleaning code: Inspect the actual column headers, the QSF survey definition file, or any JavaScript in the survey to determine which method was used. Do not assume a column naming convention.
Before writing custom reshaping code, check whether an existing package handles the data format:
cjoint::read.qualtrics() — Purpose-built for Conjoint SDT exports (Method A). Reads Qualtrics CSV directly, handles metadata rows, outputs one row per profile with a selected column. Parameters: responses (choice column names), covariates (respondent-level variables), respondentID, new.format (TRUE for 3-row headers). Limited to binary forced-choice outcomes.
cjdata::reshape_conjoint() — Lightweight alternative. Functions: read_Qualtrics() + reshape_conjoint(). Handles basic wide-to-long conversion. Requires binary outcome variables (1/2 or "A"/"B"). Respondent covariates merged separately.
projoint::reshape_projoint() — For measurement-error-corrected analysis. Built-in support for repeated tasks (IRR estimation), missing data imputation (.fill = TRUE), and bias-corrected AMCEs. Outcome column names must contain task ID digits.
cregg::cj_tidy() — Reshapes wide-format data with automatic constraint detection via formula notation.
When to use manual reshaping: When the Qualtrics implementation uses custom embedded data fields (Method B) with non-standard column naming, or when the data requires language translation, attribute name merges, or pilot data exclusion that existing packages cannot handle.
When existing packages cannot handle the data format, reshape manually. The goal: one row per respondent x task x profile, one column per attribute.
Step 1: Build a long table of (ResponseId, task, profile, attribute_name, attribute_value)
Iterate over tasks, profiles, and attribute positions. For each combination, read the attribute name from the name column and the corresponding value from the value column. This naturally handles randomized attribute order.
for task in 1:T:
name_cols <- paste0(prefix, "-F-", task, "-", 1:K)
for profile in 1:P:
val_cols <- paste0(prefix, "-F-", task, "-", profile, "-", 1:K)
for idx in 1:K:
append(ResponseId, task, profile, data[[name_cols[idx]]], data[[val_cols[idx]]])
Use data.table::rbindlist() for performance with large datasets.
Step 2: Filter missing data. Remove rows where attribute_name or attribute_value is NA — these indicate respondents who skipped the conjoint section.
Step 3: Apply attribute name and level merges before pivoting. Fix typos, encoding variants, or synonymous levels.
Step 4: Pivot to wide-by-attribute using data.table::dcast() or tidyr::pivot_wider():
dcast(long, ResponseId + task + profile ~ attribute_name, value.var = "attribute_value")
If dcast warns about duplicate row/column combinations, two positions in the same task share an attribute name for some respondents — investigate the name merge.
Choice variables are separate Qualtrics questions (MC type), one per task, placed after each conjoint display.
Identify choice columns: These are NOT part of the embedded data fields. Map each choice question to its task number (Q17 -> task 1, Q19 -> task 2, etc.). Inspect the QSF or survey flow to confirm the mapping.
Text vs. numeric encoding: Text exports produce labels like "Person A"/"Person B" or "Profile 1"/"Profile 2". Numeric exports produce 1/2. Always verify from the actual data — do not assume.
Create the binary outcome:
chosen_profile <- ifelse(raw_choice == "Person A", 1L,
ifelse(raw_choice == "Person B", 2L, NA_integer_))
# Merge by (ResponseId, task), then:
chosen <- as.integer(profile == chosen_profile)
Handle missing choices: Drop rows where chosen is NA. Some tasks may have higher dropout rates than others (especially the last task).
Ratings may be per-profile (one per profile per task — usable as a continuous conjoint DV) or per-task (one rating of the chosen profile — descriptive only, not a standard conjoint DV).
Endpoint label recoding: Qualtrics text exports encode scale endpoints as text labels (e.g., "Strongly disagree" = 1, "Strongly agree" = 7). Recode these before converting to numeric. Intermediate scale points are already numeric.
Use recode_factor() with deliberate level ordering. The order of arguments sets the factor level order, which determines:
Reference category principles:
Catching unexpected values: Set .default = NA_character_ in recode_factor(). Without this, unrecognized values silently become new factor levels, masking pilot data contamination. This is not default behavior — it must be set explicitly.
Drop unused levels: After filtering, call droplevels() to remove factor levels with zero observations. cregg::cj() requires 2+ realized levels per factor.
Pilot detection: Compare unique attribute levels in the data against the final design document. Extra levels (e.g., a country not in the final design) indicate pilot/pre-test respondents. Exclude these as respondents (all their rows), not just the anomalous rows — their entire randomization was generated by a different design.
Missing conjoint data: Respondents who skipped the conjoint section produce all-NA attribute columns. The NA filter in Step 2 removes them. Respondents who dropped out mid-conjoint will have fewer than T x P rows — this is acceptable for cregg::cj().
Duplicate attribute names: Each task should have exactly K unique attribute names. If a merge creates duplicates within a task, the merge is incorrect.
Choice completeness: chosen should sum to T per respondent (one chosen profile per task). Fewer indicates missing choices for some tasks.
Include respondent-level covariates in the analysis-ready dataset even for main-effects-only analysis. Future subgroup and interaction analyses should not require re-running data prep.
Merge demographics (age, gender, education, urban/rural, ethnicity, party membership), treatment assignments, randomization indicators, and open-text responses by ResponseId after reshaping.
Save as .rds files (one per conjoint). The output should have:
cregg::cj() will error on character columns)ResponseId: respondent identifier (character or numeric)task: task number (integer)profile: profile number (integer)chosen: binary outcome (numeric 0/1, not logical or factor)rating (numeric)cregg compatibility: cj(data, chosen ~ Attr1 + Attr2 + ..., id = ~ResponseId). The id parameter requires a tilde formula (~ResponseId), not a bare name. The estimate parameter accepts "amce", "mm", "mm_differences", "amce_differences", and "frequencies".
cjoint compatibility: Expects a selected column (logical) and attributes named with the F-based convention. Use cjoint::amce() for estimation.
projoint compatibility: Requires a projoint_data object created via reshape_projoint(). Supports repeated-task IRR estimation and bias-corrected AMCEs.
cjoint, cjdata, or projoint can handle the data format before writing custom reshaping codechosen column is numeric, not logical or factorrecode_factor(.default = NA_character_) used to surface unexpected values