Search everything...

Skill

maxquant-proteomics

Processes MaxQuant proteomics outputs in Python: parses proteinGroups.txt, filters contaminants/decoys, log2 median-normalizes, imputes MNAR, t-tests with FDR, volcano plots, GO enrichment.

Python

data-engineering

npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

MaxQuant is the community-standard software for label-free quantification (LFQ) and SILAC proteomics. It performs database search, protein grouping, and intensity-based quantification from raw LC-MS/MS files, producing `proteinGroups.txt` as the primary output. Downstream statistical analysis — filtering, normalization, imputation, differential abundance testing, and visualization — is performe...

SKILL.md

Similar Skills

proteomics-de

778

Performs differential expression analysis on LFQ proteomics data from MaxQuant proteinGroups.txt and DIA-NN outputs, including filtering, log2 transformation, Gaussian imputation, t-tests, s0-FDR correction, PCA, and volcano plots.

6 files

clawbio

tooluniverse-proteomics-analysis

1.3k

Analyzes mass spectrometry proteomics data for protein quantification, differential expression, PTMs, and PPIs. Processes MaxQuant, Spectronaut, DIA-NN outputs; performs normalization, stats, enrichment, transcriptomics integration.

3 files

mims-harvard-tooluniverse

pyopenms-mass-spectrometry

135

Processes LC-MS/MS data using PyOpenMS for proteomics/metabolomics: mzML/mzXML I/O, signal processing (smoothing/peak picking/centroiding), feature detection/linking, peptide/protein ID with FDR.

3 files

sciagent-skills

Stats

Stars135

Forks16

Last CommitApr 28, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

maxquant-proteomics | sciagent-skills | ClaudePluginHub

Back to Skills

Skill

maxquant-proteomics

From sciagent-skills

Processes MaxQuant proteomics outputs in Python: parses proteinGroups.txt, filters contaminants/decoys, log2 median-normalizes, imputes MNAR, t-tests with FDR, volcano plots, GO enrichment.

Python

data-engineering

npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

SKILL.md

MaxQuant + Perseus — Proteomics Analysis Pipeline

Overview

MaxQuant is the community-standard software for label-free quantification (LFQ) and SILAC proteomics. It performs database search, protein grouping, and intensity-based quantification from raw LC-MS/MS files, producing proteinGroups.txt as the primary output. Downstream statistical analysis — filtering, normalization, imputation, differential abundance testing, and visualization — is performed in Python using pandas, scipy, and matplotlib/seaborn, mirroring the Perseus workflow in a reproducible scripting environment.

When to Use

Performing label-free quantification (LFQ) of proteins across multiple biological conditions — MaxQuant's MaxLFQ algorithm is the community benchmark
Running SILAC (stable isotope labeling) experiments with light/heavy or triple-label designs
Processing iTRAQ or TMT isobaric labeling experiments via MaxQuant's reporter ion quantification
Identifying and quantifying proteins when you need the widely-cited MaxQuant output format (proteinGroups.txt) for comparison with published datasets
Performing statistical differential abundance analysis on MaxQuant outputs without installing Perseus (GUI-only, Windows)
Generating publication-quality volcano plots and GO enrichment from proteomics data in a reproducible Python workflow
Use Proteome Discoverer instead when working with Thermo raw files requiring instrument-native processing or Sequest HT
Use FragPipe/MSFragger instead for GPU-accelerated database search (3–10× faster) or when processing DIA (data-independent acquisition) data

Prerequisites

MaxQuant: Windows software; download from https://maxquant.org/ (v2.4+); requires .NET 6 runtime
Python packages: pandas, numpy, scipy, matplotlib, seaborn, statsmodels, gseapy
Data requirements: Thermo .raw files or mzML-converted files; FASTA protein database (UniProt reviewed + contaminant database)
Environment: MaxQuant runs on Windows (GUI or CLI); Python analysis runs cross-platform

pip install pandas numpy scipy matplotlib seaborn statsmodels gseapy

# Install pyMaxQuant for programmatic mqpar.xml configuration
pip install pymaxquant

Quick Start

import pandas as pd
import numpy as np

# Load MaxQuant output
df = pd.read_csv("combined/txt/proteinGroups.txt", sep="\t", low_memory=False)
print(f"Raw protein groups: {len(df)}")

# Filter contaminants, reverse decoys, only-by-site
mask = (
    (df["Potential contaminant"] != "+") &
    (df["Reverse"] != "+") &
    (df["Only identified by site"] != "+")
)
df = df[mask].copy()
print(f"After filtering: {len(df)} protein groups")

# Extract LFQ intensity columns
lfq_cols = [c for c in df.columns if c.startswith("LFQ intensity ")]
print(f"LFQ columns: {lfq_cols}")

# Log2-transform (0 → NaN)
lfq = df[lfq_cols].replace(0, np.nan)
lfq = np.log2(lfq)
print(f"Valid values per sample:\n{lfq.notna().sum()}")

Workflow

Step 1: Configure MaxQuant Parameters via mqpar.xml

MaxQuant is controlled by an XML parameter file (mqpar.xml). Edit it programmatically to set file paths, enzyme, modifications, and quantification type before running the search.

import xml.etree.ElementTree as ET

def update_mqpar(template_path: str, output_path: str,
                 raw_files: list[str], fasta_path: str,
                 experiment_names: list[str]) -> None:
    """Update mqpar.xml with sample-specific file paths."""
    tree = ET.parse(template_path)
    root = tree.getroot()

    # Set raw file paths
    file_paths_node = root.find(".//filePaths")
    file_paths_node.clear()
    for rf in raw_files:
        elem = ET.SubElement(file_paths_node, "string")
        elem.text = rf

    # Set experiment names (maps files to conditions)
    experiments_node = root.find(".//experiments")
    experiments_node.clear()
    for name in experiment_names:
        elem = ET.SubElement(experiments_node, "string")
        elem.text = name

    # Set FASTA database
    fasta_node = root.find(".//fastaFiles/FastaFileInfo/fastaFilePath")
    fasta_node.text = fasta_path

    tree.write(output_path, xml_declaration=True, encoding="utf-8")
    print(f"Written: {output_path}")

# Example usage
raw_files = [
    r"C:\Data\ctrl_rep1.raw",
    r"C:\Data\ctrl_rep2.raw",
    r"C:\Data\treat_rep1.raw",
    r"C:\Data\treat_rep2.raw",
]
update_mqpar(
    template_path="mqpar_template.xml",
    output_path="mqpar.xml",
    raw_files=raw_files,
    fasta_path=r"C:\Databases\human_uniprot_contaminants.fasta",
    experiment_names=["ctrl", "ctrl", "treat", "treat"],
)

Key mqpar.xml parameters (set in template or edit directly):

<!-- Enzyme and search settings -->
<enzymes>
  <string>Trypsin/P</string>
</enzymes>
<maxMissedCleavages>2</maxMissedCleavages>
<variableModifications>
  <string>Oxidation (M)</string>
  <string>Acetyl (Protein N-term)</string>
</variableModifications>
<fixedModifications>
  <string>Carbamidomethyl (C)</string>
</fixedModifications>

<!-- LFQ settings -->
<lfqMode>1</lfqMode>                     <!-- 1 = LFQ enabled -->
<lfqMinRatioCount>2</lfqMinRatioCount>   <!-- minimum peptides for LFQ -->
<matchBetweenRuns>True</matchBetweenRuns>

<!-- FDR thresholds -->
<peptideFdr>0.01</peptideFdr>
<proteinFdr>0.01</proteinFdr>

Step 2: Run MaxQuant from Command Line (Windows)

MaxQuant can be run headlessly from the Windows command prompt using the bundled MaxQuantCmd.exe.

REM Windows Command Prompt — run MaxQuant with configured mqpar.xml
REM Adjust path to match your MaxQuant installation directory

set MQ_PATH=C:\Program Files\MaxQuant\bin\MaxQuantCmd.exe
set MQPAR=C:\Projects\proteomics\mqpar.xml

"%MQ_PATH%" "%MQPAR%"

REM For specific workflow steps only (useful for reruns):
REM Step IDs: 0=write tables, 1=feature detection, 7=peptide identification
"%MQ_PATH%" "%MQPAR%" --steps 1,7,11

# Cross-platform: run MaxQuant under Wine on Linux/macOS (CI/server use)
wine MaxQuantCmd.exe mqpar.xml

# Monitor progress log
tail -f combined/proc/#runningTimes.txt

Step 3: Load and Filter proteinGroups.txt

Filter out reverse decoys, potential contaminants, and proteins only identified by modification site.

import pandas as pd
import numpy as np

def load_protein_groups(path: str) -> pd.DataFrame:
    """Load MaxQuant proteinGroups.txt with quality filters applied."""
    df = pd.read_csv(path, sep="\t", low_memory=False)
    print(f"Total protein groups: {len(df)}")

    # Remove reverse decoys, contaminants, and only-by-site hits
    n_before = len(df)
    df = df[
        (df.get("Reverse", pd.Series("")) != "+") &
        (df.get("Potential contaminant", pd.Series("")) != "+") &
        (df.get("Only identified by site", pd.Series("")) != "+")
    ].copy()
    print(f"After quality filter: {len(df)} ({n_before - len(df)} removed)")

    # Parse gene names (take first entry for multi-gene groups)
    df["Gene names"] = df["Gene names"].fillna("Unknown").str.split(";").str[0]

    # Set unique index on majority protein ID
    df = df.set_index("Majority protein IDs")
    return df

# Load output
pg = load_protein_groups("combined/txt/proteinGroups.txt")

# Identify LFQ intensity columns
lfq_cols = [c for c in pg.columns if c.startswith("LFQ intensity ")]
print(f"LFQ samples ({len(lfq_cols)}): {lfq_cols}")
# Output: LFQ samples (6): ['LFQ intensity ctrl_1', 'LFQ intensity ctrl_2', ...]

Step 4: Log2 Transform and Median Normalize LFQ Intensities

Replace zero intensities with NaN (missing values in MaxQuant are exported as 0), log2-transform, then apply per-sample median centering.

def prepare_lfq_matrix(df: pd.DataFrame, lfq_cols: list[str]) -> pd.DataFrame:
    """Extract, transform, and normalize LFQ intensity matrix."""
    # Extract and rename columns (strip 'LFQ intensity ' prefix)
    lfq = df[lfq_cols].copy()
    lfq.columns = [c.replace("LFQ intensity ", "") for c in lfq_cols]

    # Replace 0 with NaN (MaxQuant encodes missing as 0)
    lfq = lfq.replace(0, np.nan)

    # Log2 transform
    lfq = np.log2(lfq)

    # Median centering per sample (subtract per-column median of valid values)
    col_medians = lfq.median(axis=0)
    global_median = col_medians.median()
    lfq = lfq.subtract(col_medians, axis=1).add(global_median)

    print(f"Matrix shape: {lfq.shape}")
    print(f"Missing values per sample:\n{lfq.isna().sum()}")
    print(f"Valid values per sample:\n{lfq.notna().sum()}")
    return lfq

lfq_matrix = prepare_lfq_matrix(pg, lfq_cols)
# Matrix shape: (3241, 6)
# Missing values per sample: ctrl_1: 421, ctrl_2: 389, ...

Step 5: Impute Missing Values (MNAR Strategy)

Missing-not-at-random (MNAR) values arise from proteins below the detection limit. Impute from the low end of the observed intensity distribution — the standard Perseus approach.

def impute_mnar(lfq: pd.DataFrame,
                width: float = 0.3,
                downshift: float = 1.8,
                random_state: int = 42) -> pd.DataFrame:
    """
    Impute MNAR missing values from a downshifted Gaussian.

    Parameters
    ----------
    width      : std of imputation distribution (fraction of sample std)
    downshift  : downshift in units of sample std below mean
    random_state : for reproducibility
    """
    rng = np.random.default_rng(random_state)
    lfq_imp = lfq.copy()

    for col in lfq_imp.columns:
        col_data = lfq_imp[col].dropna()
        col_mean = col_data.mean()
        col_std = col_data.std()

        n_missing = lfq_imp[col].isna().sum()
        if n_missing > 0:
            imputed = rng.normal(
                loc=col_mean - downshift * col_std,
                scale=width * col_std,
                size=n_missing,
            )
            lfq_imp.loc[lfq_imp[col].isna(), col] = imputed

    print(f"Imputed {lfq.isna().sum().sum()} missing values")
    return lfq_imp

lfq_imputed = impute_mnar(lfq_matrix)
# Imputed 2847 missing values

Step 6: Statistical Testing — t-test with FDR Correction

Perform two-sample t-tests for each protein between conditions, then apply Benjamini-Hochberg FDR correction.

from scipy import stats
from statsmodels.stats.multitest import multipletests

def differential_abundance(lfq: pd.DataFrame,
                            group_a: list[str],
                            group_b: list[str],
                            alpha: float = 0.05) -> pd.DataFrame:
    """
    Two-sample t-test + BH FDR correction for all proteins.

    Parameters
    ----------
    group_a, group_b : sample name lists for each condition
    alpha : FDR threshold
    """
    results = []
    for protein_id, row in lfq.iterrows():
        a_vals = row[group_a].dropna().values
        b_vals = row[group_b].dropna().values

        if len(a_vals) >= 2 and len(b_vals) >= 2:
            t_stat, p_val = stats.ttest_ind(a_vals, b_vals, equal_var=False)
            log2fc = b_vals.mean() - a_vals.mean()
        else:
            t_stat, p_val, log2fc = np.nan, np.nan, np.nan

        results.append({
            "protein_id": protein_id,
            "log2FC": log2fc,
            "pvalue": p_val,
            "t_stat": t_stat,
        })

    res_df = pd.DataFrame(results).set_index("protein_id")

    # BH FDR correction on valid p-values
    valid = res_df["pvalue"].notna()
    _, padj, _, _ = multipletests(res_df.loc[valid, "pvalue"], method="fdr_bh")
    res_df.loc[valid, "padj"] = padj

    # Add significance flag
    res_df["significant"] = (res_df["padj"] < alpha) & (res_df["pvalue"].notna())
    sig_count = res_df["significant"].sum()
    print(f"Significant proteins (FDR < {alpha}): {sig_count}")

    return res_df.sort_values("padj")

# Define sample groups
group_ctrl  = ["ctrl_1",  "ctrl_2",  "ctrl_3"]
group_treat = ["treat_1", "treat_2", "treat_3"]

results = differential_abundance(lfq_imputed, group_ctrl, group_treat)
print(results[results["significant"]].head(10))
# Significant proteins (FDR < 0.05): 312

Step 7: Volcano Plot Visualization

Generate a publication-quality volcano plot showing log2 fold change vs. -log10(p-value) with significance thresholds highlighted.

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

def plot_volcano(results: pd.DataFrame,
                 gene_names: pd.Series,
                 fc_threshold: float = 1.0,
                 pval_threshold: float = 0.05,
                 top_n_labels: int = 10,
                 save_path: str = "volcano_plot.pdf") -> None:
    """Volcano plot: log2FC vs -log10(adjusted p-value)."""
    df = results.copy()
    df["gene"] = gene_names.reindex(df.index).fillna("Unknown")
    df["-log10p"] = -np.log10(df["padj"].clip(lower=1e-300))

    # Classify regulation
    df["regulation"] = "ns"
    df.loc[(df["log2FC"] >  fc_threshold) & (df["padj"] < pval_threshold), "regulation"] = "up"
    df.loc[(df["log2FC"] < -fc_threshold) & (df["padj"] < pval_threshold), "regulation"] = "down"

    color_map = {"up": "#D62728", "down": "#1F77B4", "ns": "#AAAAAA"}

    fig, ax = plt.subplots(figsize=(7, 6))
    for reg, grp in df.groupby("regulation"):
        ax.scatter(grp["log2FC"], grp["-log10p"],
                   c=color_map[reg], s=12, alpha=0.7, linewidths=0, label=reg)

    # Threshold lines
    ax.axhline(-np.log10(pval_threshold), color="k", lw=0.8, ls="--", alpha=0.5)
    ax.axvline( fc_threshold,  color="k", lw=0.8, ls="--", alpha=0.5)
    ax.axvline(-fc_threshold, color="k", lw=0.8, ls="--", alpha=0.5)

    # Label top significant proteins by -log10p
    top = df[df["regulation"] != "ns"].nlargest(top_n_labels, "-log10p")
    for _, row in top.iterrows():
        ax.text(row["log2FC"], row["-log10p"] + 0.1, row["gene"],
                fontsize=6, ha="center", va="bottom")

    # Counts in legend
    up_n   = (df["regulation"] == "up").sum()
    down_n = (df["regulation"] == "down").sum()
    patches = [
        mpatches.Patch(color="#D62728", label=f"Up ({up_n})"),
        mpatches.Patch(color="#1F77B4", label=f"Down ({down_n})"),
        mpatches.Patch(color="#AAAAAA", label="ns"),
    ]
    ax.legend(handles=patches, fontsize=8, frameon=False)
    ax.set_xlabel("log₂ Fold Change (treat / ctrl)", fontsize=11)
    ax.set_ylabel("-log₁₀(adjusted p-value)", fontsize=11)
    ax.set_title("Differential Protein Abundance", fontsize=12)
    plt.tight_layout()
    plt.savefig(save_path, dpi=300, bbox_inches="tight")
    plt.show()
    print(f"Saved: {save_path}")

plot_volcano(results, pg["Gene names"])

Step 8: GO/Pathway Enrichment of Significant Proteins

Run over-representation analysis (ORA) on significantly up- and down-regulated proteins using gseapy's Enrichr API.

import gseapy as gp

def run_enrichment(results: pd.DataFrame,
                   gene_names: pd.Series,
                   gene_sets: list[str] | None = None,
                   top_n: int = 20) -> dict[str, pd.DataFrame]:
    """
    ORA enrichment for up- and down-regulated proteins via Enrichr.

    Parameters
    ----------
    gene_sets : Enrichr gene set libraries (default: GO BP + KEGG)
    top_n     : top results to display per direction
    """
    if gene_sets is None:
        gene_sets = ["GO_Biological_Process_2023", "KEGG_2021_Human"]

    results_out = {}
    gene_map = gene_names.reindex(results.index).fillna("Unknown")

    for direction in ("up", "down"):
        sig = results[
            (results["significant"]) &
            (results["log2FC"] > 0 if direction == "up" else results["log2FC"] < 0)
        ]
        gene_list = gene_map.reindex(sig.index).tolist()
        print(f"{direction.capitalize()}-regulated: {len(gene_list)} proteins")

        if len(gene_list) < 5:
            print(f"  Too few proteins for enrichment (n={len(gene_list)}), skipping")
            continue

        enr = gp.enrichr(
            gene_list=gene_list,
            gene_sets=gene_sets,
            organism="Human",
            outdir=f"enrichment_{direction}",
            cutoff=0.05,
        )
        top_results = enr.results.sort_values("Adjusted P-value").head(top_n)
        results_out[direction] = top_results
        print(top_results[["Term", "Adjusted P-value", "Overlap"]].to_string(index=False))

    return results_out

enr_results = run_enrichment(results, pg["Gene names"])

Key Parameters

Parameter	Default	Range / Options	Effect
`matchBetweenRuns`	`False`	`True` / `False`	Transfers identifications across runs by retention time matching; increases quantified protein count 10–30%
`lfqMinRatioCount`	`2`	`1`–`5`	Minimum peptide pairs required for LFQ normalization; lower values increase coverage but reduce accuracy
`maxMissedCleavages`	`2`	`0`–`4`	Tryptic missed cleavages allowed; increase for samples with poor digestion
`peptideFdr` / `proteinFdr`	`0.01`	`0.001`–`0.05`	FDR thresholds for peptide and protein identifications
MNAR `downshift`	`1.8`	`1.5`–`2.5`	Shifts imputation distribution below detection limit in units of column std; larger = more conservative imputation
MNAR `width`	`0.3`	`0.1`–`0.5`	Width of imputed distribution relative to column std
t-test `alpha`	`0.05`	`0.01`–`0.1`	FDR significance threshold for differential abundance
`fc_threshold` (volcano)	`1.0`	`0.5`–`2.0`	log2 fold-change cutoff for "significant" label in volcano plot

Key Concepts

MaxQuant Output Files

File	Content
`proteinGroups.txt`	Primary output: one row per protein group with LFQ/SILAC intensities, peptide counts, sequence coverage
`peptides.txt`	Peptide-level quantification with charge states and modifications
`evidence.txt`	Individual MS/MS identifications (one row per peptide-spectrum match)
`msms.txt`	Full MS/MS scan data including fragment ions and scores
`summary.txt`	Per-raw-file statistics: identifications, MS/MS counts, calibration

LFQ Intensity vs. iBAQ

LFQ (Label-Free Quantification): MaxLFQ algorithm normalizes intensities across samples based on razor+unique peptide ratios. Use for cross-sample comparisons (fold changes). Stored in LFQ intensity <sample> columns.
iBAQ (intensity-Based Absolute Quantification): Divides summed peptide intensities by the number of theoretically observable peptides. Use for estimating copy numbers and comparing absolute abundance between proteins within a sample. Stored in iBAQ column.
SILAC ratio: Direct H/L ratio from isotope-labeled pairs. More accurate than LFQ for small fold changes.

Perseus Equivalent Operations in Python

Perseus step	Python equivalent
Filter rows by categorical column	`df[df["Reverse"] != "+"]`
Replace 0 with NaN	`df.replace(0, np.nan)`
Log2 transform	`np.log2(df)`
Median normalization	`df.subtract(df.median()).add(global_median)`
MNAR imputation (normal distribution)	`impute_mnar()` function above
Two-sample t-test	`scipy.stats.ttest_ind()` + `multipletests()`
Volcano plot	`matplotlib.pyplot` scatter + threshold lines
Hierarchical clustering	`seaborn.clustermap()`

Common Recipes

Recipe: SILAC Ratio Analysis

When to use: SILAC experiments with H/L or H/M/L labeling instead of LFQ.

import pandas as pd
import numpy as np

# Load proteinGroups.txt for SILAC experiment
df = pd.read_csv("combined/txt/proteinGroups.txt", sep="\t", low_memory=False)

# Filter contaminants and decoys
df = df[(df["Reverse"] != "+") & (df["Potential contaminant"] != "+")].copy()

# Extract H/L ratio columns (log2-transformed)
ratio_cols = [c for c in df.columns if c.startswith("Ratio H/L ") and "normalized" in c.lower()]
if not ratio_cols:
    # Fall back to non-normalized
    ratio_cols = [c for c in df.columns if c.startswith("Ratio H/L")]

print(f"SILAC ratio columns: {ratio_cols}")
ratios = df[ratio_cols].copy().replace(0, np.nan)

# Log2 transform ratios
log2_ratios = np.log2(ratios)
log2_ratios.columns = [c.replace("Ratio H/L normalized ", "") for c in ratio_cols]

# Summary statistics per sample
print(log2_ratios.describe().round(3))

Recipe: Hierarchical Clustering Heatmap

When to use: visualizing patterns across all significant proteins simultaneously.

import seaborn as sns
import matplotlib.pyplot as plt

def plot_heatmap(lfq_imputed: pd.DataFrame,
                 results: pd.DataFrame,
                 gene_names: pd.Series,
                 top_n: int = 50,
                 save_path: str = "heatmap.pdf") -> None:
    """Hierarchical clustering heatmap of top significant proteins."""
    sig_proteins = results[results["significant"]].nlargest(top_n, "-log10p" if "-log10p" in results else "padj").index
    # Recalculate if needed
    sig_proteins = results[results["significant"]].nsmallest(top_n, "padj").index

    heatmap_data = lfq_imputed.loc[sig_proteins].copy()
    heatmap_data.index = gene_names.reindex(sig_proteins).fillna(sig_proteins)

    # Z-score per row for visualization
    heatmap_z = heatmap_data.subtract(heatmap_data.mean(axis=1), axis=0).divide(
        heatmap_data.std(axis=1).replace(0, 1), axis=0
    )

    g = sns.clustermap(
        heatmap_z,
        cmap="RdBu_r",
        center=0,
        vmin=-2.5, vmax=2.5,
        figsize=(8, 10),
        yticklabels=True,
        xticklabels=True,
        dendrogram_ratio=(0.15, 0.1),
        cbar_kws={"label": "Z-score (log₂ LFQ)"},
    )
    g.ax_heatmap.set_yticklabels(g.ax_heatmap.get_yticklabels(), fontsize=6)
    plt.savefig(save_path, dpi=300, bbox_inches="tight")
    print(f"Saved: {save_path}")

plot_heatmap(lfq_imputed, results, pg["Gene names"])

Recipe: STRING-db Network Enrichment for Significant Proteins

When to use: protein-protein interaction network analysis and enrichment without downloading gene sets locally.

import requests
import pandas as pd

def string_enrichment(gene_list: list[str],
                      species: int = 9606,
                      fdr_threshold: float = 0.05) -> pd.DataFrame:
    """Query STRING /enrichment endpoint for GO/KEGG enrichment."""
    url = "https://string-db.org/api/json/enrichment"
    params = {
        "identifiers": "\r".join(gene_list),
        "species": species,
        "caller_identity": "maxquant_proteomics_skill",
    }
    response = requests.post(url, data=params)
    response.raise_for_status()

    enr_df = pd.DataFrame(response.json())
    if enr_df.empty:
        print("No enrichment results returned")
        return enr_df

    enr_df = enr_df[enr_df["fdr"].astype(float) < fdr_threshold]
    enr_df = enr_df.sort_values("fdr")
    print(f"Enriched terms (FDR < {fdr_threshold}): {len(enr_df)}")
    print(enr_df[["category", "term", "description", "fdr", "number_of_genes"]].head(15).to_string(index=False))
    return enr_df

# Significant up-regulated gene names
up_genes = pg.loc[
    results[(results["significant"]) & (results["log2FC"] > 1)].index, "Gene names"
].tolist()

string_enr = string_enrichment(up_genes)

Recipe: Parse MaxQuant Output with pyMaxQuant

When to use: reading and filtering MaxQuant text files with a higher-level API.

# pyMaxQuant provides typed accessors for MaxQuant output files
# Install: pip install pymaxquant
from maxquant.io import read_protein_groups

# Load with built-in contaminant filtering
pg_clean = read_protein_groups(
    "combined/txt/proteinGroups.txt",
    filter_invalid=True,      # removes reverse, contaminant, only-by-site
)
print(f"Loaded {len(pg_clean)} filtered protein groups")

# Access LFQ columns via helper
lfq_df = pg_clean.filter(like="LFQ intensity")
print(f"LFQ matrix: {lfq_df.shape}")

Expected Outputs

File	Description
`combined/txt/proteinGroups.txt`	Main MaxQuant output: protein groups with LFQ intensities, peptide counts, unique peptides, iBAQ
`combined/txt/peptides.txt`	Peptide-level quantification with modifications and charge states
`combined/txt/summary.txt`	Per-raw-file QC statistics: identification rates, MS/MS counts
`results_differential.csv`	Differential abundance table: `log2FC`, `pvalue`, `padj`, `significant` per protein
`volcano_plot.pdf`	Volcano plot with up/down-regulated proteins colored and top proteins labeled
`heatmap.pdf`	Hierarchical clustering heatmap of top significant proteins (Z-score normalized)
`enrichment_up/`	gseapy output directory: GO/KEGG enrichment for up-regulated proteins
`enrichment_down/`	gseapy output directory: GO/KEGG enrichment for down-regulated proteins

Troubleshooting

Problem	Cause	Solution
MaxQuant produces 0 protein identifications	Wrong FASTA database or enzyme settings; raw file path not found	Verify `.raw` file paths in `mqpar.xml` are absolute Windows paths; confirm enzyme matches experiment (Trypsin/P vs Trypsin); check `summary.txt` for identification rate
All LFQ intensities are 0 after filtering	`matchBetweenRuns` off + sparse data, or wrong column selection	Check `combined/txt/proteinGroups.txt` directly; use `pg.filter(like="LFQ intensity")` to confirm column names; lower `lfqMinRatioCount` to 1
Too many missing values after log2 transform	Insufficient replicates, inconsistent sample loading, or undetected peptides	Enable `matchBetweenRuns`; verify equal protein loading (Bradford/BCA); consider stricter valid-value filter (require 3/3 per group) before imputation
Memory error loading `proteinGroups.txt`	File is large (>500 MB for DDA with many samples)	Use `pd.read_csv(..., low_memory=False, usecols=[...])` to select only needed columns; or use `pd.read_csv(..., chunksize=...)`
gseapy Enrichr returns empty results	Gene symbols unrecognized or network timeout	Ensure gene list uses HGNC symbols (not UniProt IDs); check internet connectivity; use `gp.enrichr(..., timeout=60)`
Volcano plot: all proteins in "ns"	FDR threshold too stringent or `padj` not calculated	Verify `multipletests` returned valid FDR values; try relaxing `alpha` to 0.1; check sample group assignments are correct
MaxQuant run hangs at "Feature detection"	Low memory (MaxQuant needs 4–8 GB RAM per 3–4 raw files)	Process files in smaller batches; increase system RAM; close other applications
Imputation inflates false positives	Imputing too aggressively (low `downshift`)	Increase `downshift` to 2.0–2.5; alternatively, filter to proteins with ≥ 2 valid values per group before testing

References

MaxQuant software and documentation — official MaxQuant download, manual, and parameter guide
Perseus computational platform — MaxQuant companion statistical analysis tool; basis for the Python workflow above
Cox et al. (2008) Nature Biotechnology — MaxQuant paper — original MaxQuant publication describing the MaxLFQ algorithm
Tyanova et al. (2016) Nature Methods — Perseus paper — Perseus statistical framework reference; all workflow steps above mirror Perseus operations
pyMaxQuant GitHub — Python library for parsing MaxQuant output files
gseapy documentation — gene set enrichment analysis library used in Step 8
STRING API enrichment — protein network enrichment endpoint used in Common Recipes

Similar Skills

proteomics-de

778

6 files

clawbio

tooluniverse-proteomics-analysis

1.3k

3 files

mims-harvard-tooluniverse

pyopenms-mass-spectrometry

135

Processes LC-MS/MS data using PyOpenMS for proteomics/metabolomics: mzML/mzXML I/O, signal processing (smoothing/peak picking/centroiding), feature detection/linking, peptide/protein ID with FDR.

3 files

sciagent-skills

Stats

Stars135

Forks16

Last CommitApr 28, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

MaxQuant + Perseus — Proteomics Analysis Pipeline

Overview

MaxQuant is the community-standard software for label-free quantification (LFQ) and SILAC proteomics. It performs database search, protein grouping, and intensity-based quantification from raw LC-MS/MS files, producing proteinGroups.txt as the primary output. Downstream statistical analysis — filtering, normalization, imputation, differential abundance testing, and visualization — is performed in Python using pandas, scipy, and matplotlib/seaborn, mirroring the Perseus workflow in a reproducible scripting environment.

When to Use

Performing label-free quantification (LFQ) of proteins across multiple biological conditions — MaxQuant's MaxLFQ algorithm is the community benchmark
Running SILAC (stable isotope labeling) experiments with light/heavy or triple-label designs
Processing iTRAQ or TMT isobaric labeling experiments via MaxQuant's reporter ion quantification
Identifying and quantifying proteins when you need the widely-cited MaxQuant output format (proteinGroups.txt) for comparison with published datasets
Performing statistical differential abundance analysis on MaxQuant outputs without installing Perseus (GUI-only, Windows)
Generating publication-quality volcano plots and GO enrichment from proteomics data in a reproducible Python workflow
Use Proteome Discoverer instead when working with Thermo raw files requiring instrument-native processing or Sequest HT
Use FragPipe/MSFragger instead for GPU-accelerated database search (3–10× faster) or when processing DIA (data-independent acquisition) data

Prerequisites

MaxQuant: Windows software; download from https://maxquant.org/ (v2.4+); requires .NET 6 runtime
Python packages: pandas, numpy, scipy, matplotlib, seaborn, statsmodels, gseapy
Data requirements: Thermo .raw files or mzML-converted files; FASTA protein database (UniProt reviewed + contaminant database)
Environment: MaxQuant runs on Windows (GUI or CLI); Python analysis runs cross-platform

pip install pandas numpy scipy matplotlib seaborn statsmodels gseapy

# Install pyMaxQuant for programmatic mqpar.xml configuration
pip install pymaxquant

Quick Start

import pandas as pd
import numpy as np

# Load MaxQuant output
df = pd.read_csv("combined/txt/proteinGroups.txt", sep="\t", low_memory=False)
print(f"Raw protein groups: {len(df)}")

# Filter contaminants, reverse decoys, only-by-site
mask = (
    (df["Potential contaminant"] != "+") &
    (df["Reverse"] != "+") &
    (df["Only identified by site"] != "+")
)
df = df[mask].copy()
print(f"After filtering: {len(df)} protein groups")

# Extract LFQ intensity columns
lfq_cols = [c for c in df.columns if c.startswith("LFQ intensity ")]
print(f"LFQ columns: {lfq_cols}")

# Log2-transform (0 → NaN)
lfq = df[lfq_cols].replace(0, np.nan)
lfq = np.log2(lfq)
print(f"Valid values per sample:\n{lfq.notna().sum()}")

Workflow

Step 1: Configure MaxQuant Parameters via mqpar.xml

MaxQuant is controlled by an XML parameter file (mqpar.xml). Edit it programmatically to set file paths, enzyme, modifications, and quantification type before running the search.

import xml.etree.ElementTree as ET

def update_mqpar(template_path: str, output_path: str,
                 raw_files: list[str], fasta_path: str,
                 experiment_names: list[str]) -> None:
    """Update mqpar.xml with sample-specific file paths."""
    tree = ET.parse(template_path)
    root = tree.getroot()

    # Set raw file paths
    file_paths_node = root.find(".//filePaths")
    file_paths_node.clear()
    for rf in raw_files:
        elem = ET.SubElement(file_paths_node, "string")
        elem.text = rf

    # Set experiment names (maps files to conditions)
    experiments_node = root.find(".//experiments")
    experiments_node.clear()
    for name in experiment_names:
        elem = ET.SubElement(experiments_node, "string")
        elem.text = name

    # Set FASTA database
    fasta_node = root.find(".//fastaFiles/FastaFileInfo/fastaFilePath")
    fasta_node.text = fasta_path

    tree.write(output_path, xml_declaration=True, encoding="utf-8")
    print(f"Written: {output_path}")

# Example usage
raw_files = [
    r"C:\Data\ctrl_rep1.raw",
    r"C:\Data\ctrl_rep2.raw",
    r"C:\Data\treat_rep1.raw",
    r"C:\Data\treat_rep2.raw",
]
update_mqpar(
    template_path="mqpar_template.xml",
    output_path="mqpar.xml",
    raw_files=raw_files,
    fasta_path=r"C:\Databases\human_uniprot_contaminants.fasta",
    experiment_names=["ctrl", "ctrl", "treat", "treat"],
)

Key mqpar.xml parameters (set in template or edit directly):

<!-- Enzyme and search settings -->
<enzymes>
  <string>Trypsin/P</string>
</enzymes>
<maxMissedCleavages>2</maxMissedCleavages>
<variableModifications>
  <string>Oxidation (M)</string>
  <string>Acetyl (Protein N-term)</string>
</variableModifications>
<fixedModifications>
  <string>Carbamidomethyl (C)</string>
</fixedModifications>

<!-- LFQ settings -->
<lfqMode>1</lfqMode>                     <!-- 1 = LFQ enabled -->
<lfqMinRatioCount>2</lfqMinRatioCount>   <!-- minimum peptides for LFQ -->
<matchBetweenRuns>True</matchBetweenRuns>

<!-- FDR thresholds -->
<peptideFdr>0.01</peptideFdr>
<proteinFdr>0.01</proteinFdr>

Step 2: Run MaxQuant from Command Line (Windows)

MaxQuant can be run headlessly from the Windows command prompt using the bundled MaxQuantCmd.exe.

REM Windows Command Prompt — run MaxQuant with configured mqpar.xml
REM Adjust path to match your MaxQuant installation directory

set MQ_PATH=C:\Program Files\MaxQuant\bin\MaxQuantCmd.exe
set MQPAR=C:\Projects\proteomics\mqpar.xml

"%MQ_PATH%" "%MQPAR%"

REM For specific workflow steps only (useful for reruns):
REM Step IDs: 0=write tables, 1=feature detection, 7=peptide identification
"%MQ_PATH%" "%MQPAR%" --steps 1,7,11

# Cross-platform: run MaxQuant under Wine on Linux/macOS (CI/server use)
wine MaxQuantCmd.exe mqpar.xml

# Monitor progress log
tail -f combined/proc/#runningTimes.txt

Step 3: Load and Filter proteinGroups.txt

Filter out reverse decoys, potential contaminants, and proteins only identified by modification site.

import pandas as pd
import numpy as np

def load_protein_groups(path: str) -> pd.DataFrame:
    """Load MaxQuant proteinGroups.txt with quality filters applied."""
    df = pd.read_csv(path, sep="\t", low_memory=False)
    print(f"Total protein groups: {len(df)}")

    # Remove reverse decoys, contaminants, and only-by-site hits
    n_before = len(df)
    df = df[
        (df.get("Reverse", pd.Series("")) != "+") &
        (df.get("Potential contaminant", pd.Series("")) != "+") &
        (df.get("Only identified by site", pd.Series("")) != "+")
    ].copy()
    print(f"After quality filter: {len(df)} ({n_before - len(df)} removed)")

    # Parse gene names (take first entry for multi-gene groups)
    df["Gene names"] = df["Gene names"].fillna("Unknown").str.split(";").str[0]

    # Set unique index on majority protein ID
    df = df.set_index("Majority protein IDs")
    return df

# Load output
pg = load_protein_groups("combined/txt/proteinGroups.txt")

# Identify LFQ intensity columns
lfq_cols = [c for c in pg.columns if c.startswith("LFQ intensity ")]
print(f"LFQ samples ({len(lfq_cols)}): {lfq_cols}")
# Output: LFQ samples (6): ['LFQ intensity ctrl_1', 'LFQ intensity ctrl_2', ...]

Step 4: Log2 Transform and Median Normalize LFQ Intensities

Replace zero intensities with NaN (missing values in MaxQuant are exported as 0), log2-transform, then apply per-sample median centering.

def prepare_lfq_matrix(df: pd.DataFrame, lfq_cols: list[str]) -> pd.DataFrame:
    """Extract, transform, and normalize LFQ intensity matrix."""
    # Extract and rename columns (strip 'LFQ intensity ' prefix)
    lfq = df[lfq_cols].copy()
    lfq.columns = [c.replace("LFQ intensity ", "") for c in lfq_cols]

    # Replace 0 with NaN (MaxQuant encodes missing as 0)
    lfq = lfq.replace(0, np.nan)

    # Log2 transform
    lfq = np.log2(lfq)

    # Median centering per sample (subtract per-column median of valid values)
    col_medians = lfq.median(axis=0)
    global_median = col_medians.median()
    lfq = lfq.subtract(col_medians, axis=1).add(global_median)

    print(f"Matrix shape: {lfq.shape}")
    print(f"Missing values per sample:\n{lfq.isna().sum()}")
    print(f"Valid values per sample:\n{lfq.notna().sum()}")
    return lfq

lfq_matrix = prepare_lfq_matrix(pg, lfq_cols)
# Matrix shape: (3241, 6)
# Missing values per sample: ctrl_1: 421, ctrl_2: 389, ...

Step 5: Impute Missing Values (MNAR Strategy)

Missing-not-at-random (MNAR) values arise from proteins below the detection limit. Impute from the low end of the observed intensity distribution — the standard Perseus approach.

def impute_mnar(lfq: pd.DataFrame,
                width: float = 0.3,
                downshift: float = 1.8,
                random_state: int = 42) -> pd.DataFrame:
    """
    Impute MNAR missing values from a downshifted Gaussian.

    Parameters
    ----------
    width      : std of imputation distribution (fraction of sample std)
    downshift  : downshift in units of sample std below mean
    random_state : for reproducibility
    """
    rng = np.random.default_rng(random_state)
    lfq_imp = lfq.copy()

    for col in lfq_imp.columns:
        col_data = lfq_imp[col].dropna()
        col_mean = col_data.mean()
        col_std = col_data.std()

        n_missing = lfq_imp[col].isna().sum()
        if n_missing > 0:
            imputed = rng.normal(
                loc=col_mean - downshift * col_std,
                scale=width * col_std,
                size=n_missing,
            )
            lfq_imp.loc[lfq_imp[col].isna(), col] = imputed

    print(f"Imputed {lfq.isna().sum().sum()} missing values")
    return lfq_imp

lfq_imputed = impute_mnar(lfq_matrix)
# Imputed 2847 missing values

Step 6: Statistical Testing — t-test with FDR Correction

Perform two-sample t-tests for each protein between conditions, then apply Benjamini-Hochberg FDR correction.

from scipy import stats
from statsmodels.stats.multitest import multipletests

def differential_abundance(lfq: pd.DataFrame,
                            group_a: list[str],
                            group_b: list[str],
                            alpha: float = 0.05) -> pd.DataFrame:
    """
    Two-sample t-test + BH FDR correction for all proteins.

    Parameters
    ----------
    group_a, group_b : sample name lists for each condition
    alpha : FDR threshold
    """
    results = []
    for protein_id, row in lfq.iterrows():
        a_vals = row[group_a].dropna().values
        b_vals = row[group_b].dropna().values

        if len(a_vals) >= 2 and len(b_vals) >= 2:
            t_stat, p_val = stats.ttest_ind(a_vals, b_vals, equal_var=False)
            log2fc = b_vals.mean() - a_vals.mean()
        else:
            t_stat, p_val, log2fc = np.nan, np.nan, np.nan

        results.append({
            "protein_id": protein_id,
            "log2FC": log2fc,
            "pvalue": p_val,
            "t_stat": t_stat,
        })

    res_df = pd.DataFrame(results).set_index("protein_id")

    # BH FDR correction on valid p-values
    valid = res_df["pvalue"].notna()
    _, padj, _, _ = multipletests(res_df.loc[valid, "pvalue"], method="fdr_bh")
    res_df.loc[valid, "padj"] = padj

    # Add significance flag
    res_df["significant"] = (res_df["padj"] < alpha) & (res_df["pvalue"].notna())
    sig_count = res_df["significant"].sum()
    print(f"Significant proteins (FDR < {alpha}): {sig_count}")

    return res_df.sort_values("padj")

# Define sample groups
group_ctrl  = ["ctrl_1",  "ctrl_2",  "ctrl_3"]
group_treat = ["treat_1", "treat_2", "treat_3"]

results = differential_abundance(lfq_imputed, group_ctrl, group_treat)
print(results[results["significant"]].head(10))
# Significant proteins (FDR < 0.05): 312

Step 7: Volcano Plot Visualization

Generate a publication-quality volcano plot showing log2 fold change vs. -log10(p-value) with significance thresholds highlighted.

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

def plot_volcano(results: pd.DataFrame,
                 gene_names: pd.Series,
                 fc_threshold: float = 1.0,
                 pval_threshold: float = 0.05,
                 top_n_labels: int = 10,
                 save_path: str = "volcano_plot.pdf") -> None:
    """Volcano plot: log2FC vs -log10(adjusted p-value)."""
    df = results.copy()
    df["gene"] = gene_names.reindex(df.index).fillna("Unknown")
    df["-log10p"] = -np.log10(df["padj"].clip(lower=1e-300))

    # Classify regulation
    df["regulation"] = "ns"
    df.loc[(df["log2FC"] >  fc_threshold) & (df["padj"] < pval_threshold), "regulation"] = "up"
    df.loc[(df["log2FC"] < -fc_threshold) & (df["padj"] < pval_threshold), "regulation"] = "down"

    color_map = {"up": "#D62728", "down": "#1F77B4", "ns": "#AAAAAA"}

    fig, ax = plt.subplots(figsize=(7, 6))
    for reg, grp in df.groupby("regulation"):
        ax.scatter(grp["log2FC"], grp["-log10p"],
                   c=color_map[reg], s=12, alpha=0.7, linewidths=0, label=reg)

    # Threshold lines
    ax.axhline(-np.log10(pval_threshold), color="k", lw=0.8, ls="--", alpha=0.5)
    ax.axvline( fc_threshold,  color="k", lw=0.8, ls="--", alpha=0.5)
    ax.axvline(-fc_threshold, color="k", lw=0.8, ls="--", alpha=0.5)

    # Label top significant proteins by -log10p
    top = df[df["regulation"] != "ns"].nlargest(top_n_labels, "-log10p")
    for _, row in top.iterrows():
        ax.text(row["log2FC"], row["-log10p"] + 0.1, row["gene"],
                fontsize=6, ha="center", va="bottom")

    # Counts in legend
    up_n   = (df["regulation"] == "up").sum()
    down_n = (df["regulation"] == "down").sum()
    patches = [
        mpatches.Patch(color="#D62728", label=f"Up ({up_n})"),
        mpatches.Patch(color="#1F77B4", label=f"Down ({down_n})"),
        mpatches.Patch(color="#AAAAAA", label="ns"),
    ]
    ax.legend(handles=patches, fontsize=8, frameon=False)
    ax.set_xlabel("log₂ Fold Change (treat / ctrl)", fontsize=11)
    ax.set_ylabel("-log₁₀(adjusted p-value)", fontsize=11)
    ax.set_title("Differential Protein Abundance", fontsize=12)
    plt.tight_layout()
    plt.savefig(save_path, dpi=300, bbox_inches="tight")
    plt.show()
    print(f"Saved: {save_path}")

plot_volcano(results, pg["Gene names"])

Step 8: GO/Pathway Enrichment of Significant Proteins

Run over-representation analysis (ORA) on significantly up- and down-regulated proteins using gseapy's Enrichr API.

import gseapy as gp

def run_enrichment(results: pd.DataFrame,
                   gene_names: pd.Series,
                   gene_sets: list[str] | None = None,
                   top_n: int = 20) -> dict[str, pd.DataFrame]:
    """
    ORA enrichment for up- and down-regulated proteins via Enrichr.

    Parameters
    ----------
    gene_sets : Enrichr gene set libraries (default: GO BP + KEGG)
    top_n     : top results to display per direction
    """
    if gene_sets is None:
        gene_sets = ["GO_Biological_Process_2023", "KEGG_2021_Human"]

    results_out = {}
    gene_map = gene_names.reindex(results.index).fillna("Unknown")

    for direction in ("up", "down"):
        sig = results[
            (results["significant"]) &
            (results["log2FC"] > 0 if direction == "up" else results["log2FC"] < 0)
        ]
        gene_list = gene_map.reindex(sig.index).tolist()
        print(f"{direction.capitalize()}-regulated: {len(gene_list)} proteins")

        if len(gene_list) < 5:
            print(f"  Too few proteins for enrichment (n={len(gene_list)}), skipping")
            continue

        enr = gp.enrichr(
            gene_list=gene_list,
            gene_sets=gene_sets,
            organism="Human",
            outdir=f"enrichment_{direction}",
            cutoff=0.05,
        )
        top_results = enr.results.sort_values("Adjusted P-value").head(top_n)
        results_out[direction] = top_results
        print(top_results[["Term", "Adjusted P-value", "Overlap"]].to_string(index=False))

    return results_out

enr_results = run_enrichment(results, pg["Gene names"])

Key Parameters

Parameter	Default	Range / Options	Effect
`matchBetweenRuns`	`False`	`True` / `False`	Transfers identifications across runs by retention time matching; increases quantified protein count 10–30%
`lfqMinRatioCount`	`2`	`1`–`5`	Minimum peptide pairs required for LFQ normalization; lower values increase coverage but reduce accuracy
`maxMissedCleavages`	`2`	`0`–`4`	Tryptic missed cleavages allowed; increase for samples with poor digestion
`peptideFdr` / `proteinFdr`	`0.01`	`0.001`–`0.05`	FDR thresholds for peptide and protein identifications
MNAR `downshift`	`1.8`	`1.5`–`2.5`	Shifts imputation distribution below detection limit in units of column std; larger = more conservative imputation
MNAR `width`	`0.3`	`0.1`–`0.5`	Width of imputed distribution relative to column std
t-test `alpha`	`0.05`	`0.01`–`0.1`	FDR significance threshold for differential abundance
`fc_threshold` (volcano)	`1.0`	`0.5`–`2.0`	log2 fold-change cutoff for "significant" label in volcano plot

Key Concepts

MaxQuant Output Files

File	Content
`proteinGroups.txt`	Primary output: one row per protein group with LFQ/SILAC intensities, peptide counts, sequence coverage
`peptides.txt`	Peptide-level quantification with charge states and modifications
`evidence.txt`	Individual MS/MS identifications (one row per peptide-spectrum match)
`msms.txt`	Full MS/MS scan data including fragment ions and scores
`summary.txt`	Per-raw-file statistics: identifications, MS/MS counts, calibration

LFQ Intensity vs. iBAQ

LFQ (Label-Free Quantification): MaxLFQ algorithm normalizes intensities across samples based on razor+unique peptide ratios. Use for cross-sample comparisons (fold changes). Stored in LFQ intensity <sample> columns.
iBAQ (intensity-Based Absolute Quantification): Divides summed peptide intensities by the number of theoretically observable peptides. Use for estimating copy numbers and comparing absolute abundance between proteins within a sample. Stored in iBAQ column.
SILAC ratio: Direct H/L ratio from isotope-labeled pairs. More accurate than LFQ for small fold changes.

Perseus Equivalent Operations in Python

Perseus step	Python equivalent
Filter rows by categorical column	`df[df["Reverse"] != "+"]`
Replace 0 with NaN	`df.replace(0, np.nan)`
Log2 transform	`np.log2(df)`
Median normalization	`df.subtract(df.median()).add(global_median)`
MNAR imputation (normal distribution)	`impute_mnar()` function above
Two-sample t-test	`scipy.stats.ttest_ind()` + `multipletests()`
Volcano plot	`matplotlib.pyplot` scatter + threshold lines
Hierarchical clustering	`seaborn.clustermap()`

Common Recipes

Recipe: SILAC Ratio Analysis

When to use: SILAC experiments with H/L or H/M/L labeling instead of LFQ.

import pandas as pd
import numpy as np

# Load proteinGroups.txt for SILAC experiment
df = pd.read_csv("combined/txt/proteinGroups.txt", sep="\t", low_memory=False)

# Filter contaminants and decoys
df = df[(df["Reverse"] != "+") & (df["Potential contaminant"] != "+")].copy()

# Extract H/L ratio columns (log2-transformed)
ratio_cols = [c for c in df.columns if c.startswith("Ratio H/L ") and "normalized" in c.lower()]
if not ratio_cols:
    # Fall back to non-normalized
    ratio_cols = [c for c in df.columns if c.startswith("Ratio H/L")]

print(f"SILAC ratio columns: {ratio_cols}")
ratios = df[ratio_cols].copy().replace(0, np.nan)

# Log2 transform ratios
log2_ratios = np.log2(ratios)
log2_ratios.columns = [c.replace("Ratio H/L normalized ", "") for c in ratio_cols]

# Summary statistics per sample
print(log2_ratios.describe().round(3))

Recipe: Hierarchical Clustering Heatmap

When to use: visualizing patterns across all significant proteins simultaneously.

import seaborn as sns
import matplotlib.pyplot as plt

def plot_heatmap(lfq_imputed: pd.DataFrame,
                 results: pd.DataFrame,
                 gene_names: pd.Series,
                 top_n: int = 50,
                 save_path: str = "heatmap.pdf") -> None:
    """Hierarchical clustering heatmap of top significant proteins."""
    sig_proteins = results[results["significant"]].nlargest(top_n, "-log10p" if "-log10p" in results else "padj").index
    # Recalculate if needed
    sig_proteins = results[results["significant"]].nsmallest(top_n, "padj").index

    heatmap_data = lfq_imputed.loc[sig_proteins].copy()
    heatmap_data.index = gene_names.reindex(sig_proteins).fillna(sig_proteins)

    # Z-score per row for visualization
    heatmap_z = heatmap_data.subtract(heatmap_data.mean(axis=1), axis=0).divide(
        heatmap_data.std(axis=1).replace(0, 1), axis=0
    )

    g = sns.clustermap(
        heatmap_z,
        cmap="RdBu_r",
        center=0,
        vmin=-2.5, vmax=2.5,
        figsize=(8, 10),
        yticklabels=True,
        xticklabels=True,
        dendrogram_ratio=(0.15, 0.1),
        cbar_kws={"label": "Z-score (log₂ LFQ)"},
    )
    g.ax_heatmap.set_yticklabels(g.ax_heatmap.get_yticklabels(), fontsize=6)
    plt.savefig(save_path, dpi=300, bbox_inches="tight")
    print(f"Saved: {save_path}")

plot_heatmap(lfq_imputed, results, pg["Gene names"])

Recipe: STRING-db Network Enrichment for Significant Proteins

When to use: protein-protein interaction network analysis and enrichment without downloading gene sets locally.

import requests
import pandas as pd

def string_enrichment(gene_list: list[str],
                      species: int = 9606,
                      fdr_threshold: float = 0.05) -> pd.DataFrame:
    """Query STRING /enrichment endpoint for GO/KEGG enrichment."""
    url = "https://string-db.org/api/json/enrichment"
    params = {
        "identifiers": "\r".join(gene_list),
        "species": species,
        "caller_identity": "maxquant_proteomics_skill",
    }
    response = requests.post(url, data=params)
    response.raise_for_status()

    enr_df = pd.DataFrame(response.json())
    if enr_df.empty:
        print("No enrichment results returned")
        return enr_df

    enr_df = enr_df[enr_df["fdr"].astype(float) < fdr_threshold]
    enr_df = enr_df.sort_values("fdr")
    print(f"Enriched terms (FDR < {fdr_threshold}): {len(enr_df)}")
    print(enr_df[["category", "term", "description", "fdr", "number_of_genes"]].head(15).to_string(index=False))
    return enr_df

# Significant up-regulated gene names
up_genes = pg.loc[
    results[(results["significant"]) & (results["log2FC"] > 1)].index, "Gene names"
].tolist()

string_enr = string_enrichment(up_genes)

Recipe: Parse MaxQuant Output with pyMaxQuant

When to use: reading and filtering MaxQuant text files with a higher-level API.

# pyMaxQuant provides typed accessors for MaxQuant output files
# Install: pip install pymaxquant
from maxquant.io import read_protein_groups

# Load with built-in contaminant filtering
pg_clean = read_protein_groups(
    "combined/txt/proteinGroups.txt",
    filter_invalid=True,      # removes reverse, contaminant, only-by-site
)
print(f"Loaded {len(pg_clean)} filtered protein groups")

# Access LFQ columns via helper
lfq_df = pg_clean.filter(like="LFQ intensity")
print(f"LFQ matrix: {lfq_df.shape}")

Expected Outputs

File	Description
`combined/txt/proteinGroups.txt`	Main MaxQuant output: protein groups with LFQ intensities, peptide counts, unique peptides, iBAQ
`combined/txt/peptides.txt`	Peptide-level quantification with modifications and charge states
`combined/txt/summary.txt`	Per-raw-file QC statistics: identification rates, MS/MS counts
`results_differential.csv`	Differential abundance table: `log2FC`, `pvalue`, `padj`, `significant` per protein
`volcano_plot.pdf`	Volcano plot with up/down-regulated proteins colored and top proteins labeled
`heatmap.pdf`	Hierarchical clustering heatmap of top significant proteins (Z-score normalized)
`enrichment_up/`	gseapy output directory: GO/KEGG enrichment for up-regulated proteins
`enrichment_down/`	gseapy output directory: GO/KEGG enrichment for down-regulated proteins

Troubleshooting

Problem	Cause	Solution
MaxQuant produces 0 protein identifications	Wrong FASTA database or enzyme settings; raw file path not found	Verify `.raw` file paths in `mqpar.xml` are absolute Windows paths; confirm enzyme matches experiment (Trypsin/P vs Trypsin); check `summary.txt` for identification rate
All LFQ intensities are 0 after filtering	`matchBetweenRuns` off + sparse data, or wrong column selection	Check `combined/txt/proteinGroups.txt` directly; use `pg.filter(like="LFQ intensity")` to confirm column names; lower `lfqMinRatioCount` to 1
Too many missing values after log2 transform	Insufficient replicates, inconsistent sample loading, or undetected peptides	Enable `matchBetweenRuns`; verify equal protein loading (Bradford/BCA); consider stricter valid-value filter (require 3/3 per group) before imputation
Memory error loading `proteinGroups.txt`	File is large (>500 MB for DDA with many samples)	Use `pd.read_csv(..., low_memory=False, usecols=[...])` to select only needed columns; or use `pd.read_csv(..., chunksize=...)`
gseapy Enrichr returns empty results	Gene symbols unrecognized or network timeout	Ensure gene list uses HGNC symbols (not UniProt IDs); check internet connectivity; use `gp.enrichr(..., timeout=60)`
Volcano plot: all proteins in "ns"	FDR threshold too stringent or `padj` not calculated	Verify `multipletests` returned valid FDR values; try relaxing `alpha` to 0.1; check sample group assignments are correct
MaxQuant run hangs at "Feature detection"	Low memory (MaxQuant needs 4–8 GB RAM per 3–4 raw files)	Process files in smaller batches; increase system RAM; close other applications
Imputation inflates false positives	Imputing too aggressively (low `downshift`)	Increase `downshift` to 2.0–2.5; alternatively, filter to proteins with ≥ 2 valid values per group before testing

References

MaxQuant software and documentation — official MaxQuant download, manual, and parameter guide
Perseus computational platform — MaxQuant companion statistical analysis tool; basis for the Python workflow above
Cox et al. (2008) Nature Biotechnology — MaxQuant paper — original MaxQuant publication describing the MaxLFQ algorithm
Tyanova et al. (2016) Nature Methods — Perseus paper — Perseus statistical framework reference; all workflow steps above mirror Perseus operations
pyMaxQuant GitHub — Python library for parsing MaxQuant output files
gseapy documentation — gene set enrichment analysis library used in Step 8
STRING API enrichment — protein network enrichment endpoint used in Common Recipes