From agent-almanac
Fits hidden Markov models to sequential observations using Baum-Welch EM algorithm, Viterbi decoding, and model selection via information criteria. For regime detection in time series like markets, speech, or biology.
npx claudepluginhub pjt222/agent-almanacThis skill uses the workspace's default tool permissions.
---
Builds and analyzes discrete/continuous Markov chains from transition data or specs, computing stationary distributions, state classifications, and mean first passage times. Use for memoryless stochastic modeling, steady-states, or RL foundations.
Builds Bayesian models with PyMC v5+, specifies priors, runs MCMC inference via nutpie, diagnoses convergence, and compares models using ArviZ.
Builds and fits Bayesian models using PyMC: hierarchical models, MCMC (NUTS), variational inference, LOO/WAIC comparison, posterior checks for probabilistic programming.
Share bugs, ideas, or general feedback.
Fit a hidden Markov model (HMM) to sequential observation data using the Baum-Welch expectation-maximization algorithm, decode the most likely hidden state sequence via Viterbi, and select the optimal number of hidden states through information criteria.
| Input | Type | Description |
|---|---|---|
observations | sequence/matrix | Observed data sequence (univariate or multivariate) |
n_hidden_states | integer | Number of hidden states to fit (or a range for model selection) |
emission_type | string | Distribution family for emissions: "gaussian", "discrete", "poisson", "multinomial" |
| Input | Type | Default | Description |
|---|---|---|---|
initial_params | dict | random/heuristic | Initial transition matrix, emission parameters, and start probabilities |
n_restarts | integer | 10 | Number of random restarts to mitigate local optima |
max_iterations | integer | 500 | Maximum EM iterations per restart |
convergence_tol | float | 1e-6 | Log-likelihood convergence threshold for EM |
state_range | list of ints | [n_hidden_states] | Range of state counts for model selection |
covariance_type | string | "full" | For Gaussian emissions: "full", "diagonal", "spherical" |
regularization | float | 1e-6 | Small constant added to diagonal of covariance matrices to prevent singularity |
1.1. Specify the number of hidden states K (or a candidate range for model selection in Step 5).
1.2. Choose the emission distribution family based on the data type:
1.3. Define the model components:
A of size K x K: A[i,j] = P(z_t = j | z_{t-1} = i)theta_k for each state k: distribution-specific (e.g., mean and covariance for Gaussian)pi: pi[k] = P(z_1 = k)1.4. Verify observation data is properly formatted: no missing values in the sequence, consistent dimensionality, and sufficient length relative to the number of parameters.
Expected: A clearly specified HMM architecture with K states, a chosen emission family, and clean observation data of length T >> K^2.
On failure: If data contains missing values, impute or remove affected segments. If T is too small relative to K, reduce K or acquire more data.
2.1. Generate initial parameters for each of the n_restarts restarts:
2.2. For the first restart, use the K-means-informed initialization (generally the strongest start). For subsequent restarts, use random perturbations.
2.3. Verify all initial parameters are valid:
Expected: n_restarts sets of valid initial parameters, with at least one data-driven initialization.
On failure: If K-means fails to converge, use purely random initialization with more restarts. If covariance matrices are singular, add the regularization constant to the diagonal.
3.1. E-step (Forward-Backward algorithm):
alpha[t,k] = P(o_1,...,o_t, z_t=k | model) using the recursion:
alpha[1,k] = pi[k] * b_k(o_1)alpha[t,k] = sum_j(alpha[t-1,j] * A[j,k]) * b_k(o_t)beta[t,k] = P(o_{t+1},...,o_T | z_t=k, model):
beta[T,k] = 1beta[t,k] = sum_j(A[k,j] * b_j(o_{t+1}) * beta[t+1,j])gamma[t,k] = P(z_t=k | O, model):
gamma[t,k] = alpha[t,k] * beta[t,k] / P(O | model)xi[t,i,j] = P(z_t=i, z_{t+1}=j | O, model).3.2. M-step (Parameter re-estimation):
A[i,j] = sum_t(xi[t,i,j]) / sum_t(gamma[t,i])mu_k = sum_t(gamma[t,k] * o_t) / sum_t(gamma[t,k])b_k(v) = sum_t(gamma[t,k] * I(o_t=v)) / sum_t(gamma[t,k])pi[k] = gamma[1,k]3.3. Compute log-likelihood: log P(O | model) = log sum_k(alpha[T,k]). Use the log-sum-exp trick to prevent underflow.
3.4. Scaling: Use scaled forward-backward variables to prevent numerical underflow for long sequences. Normalize alpha at each time step and accumulate log scaling factors.
3.5. Repeat E-step and M-step until log-likelihood change is below convergence_tol or max_iterations is reached.
3.6. Across all restarts, keep the parameter set with the highest final log-likelihood.
Expected: Monotonically non-decreasing log-likelihood across iterations, converging within max_iterations. Final parameters are valid (stochastic matrices, positive-definite covariances).
On failure: If log-likelihood decreases, there is a bug in the E-step or M-step -- verify formulas. If convergence is very slow, try better initialization or increase max_iterations. If covariance becomes singular, increase regularization.
4.1. Initialize Viterbi variables:
delta[1,k] = log(pi[k]) + log(b_k(o_1))psi[1,k] = 0 (no predecessor)4.2. Recurse forward for t = 2,...,T:
delta[t,k] = max_j(delta[t-1,j] + log(A[j,k])) + log(b_k(o_t))psi[t,k] = argmax_j(delta[t-1,j] + log(A[j,k]))4.3. Terminate:
z*_T = argmax_k(delta[T,k])max_k(delta[T,k])4.4. Backtrace for t = T-1,...,1:
z*_t = psi[t+1, z*_{t+1}]4.5. Output the decoded state sequence z* = (z*_1, ..., z*_T) and its log-probability.
4.6. Compare the Viterbi path probability to the total sequence probability from the forward algorithm to assess how dominant the best path is.
Expected: A single most-likely state sequence of length T with each entry in {1,...,K}. The Viterbi log-probability should be less than or equal to the total log-likelihood.
On failure: If the Viterbi path has log-probability of negative infinity, some transition or emission probability is zero where it should not be. Add floor values to prevent log(0).
5.1. For each candidate number of hidden states K in state_range, fit the full HMM (Steps 2-4).
5.2. Compute the number of free parameters p:
K * (K - 1) (each row is a simplex)d dimensions: K * (d + d*(d+1)/2))K - 15.3. Compute information criteria:
BIC = -2 * log_likelihood + p * log(T)AIC = -2 * log_likelihood + 2 * pAICc = AIC + 2*p*(p+1) / (T - p - 1) (small-sample correction)5.4. Select the model with the lowest BIC (preferred for consistency) or AIC (preferred for prediction). Report both.
5.5. Tabulate results: for each K, show log-likelihood, number of parameters, BIC, AIC, and convergence status.
5.6. If the optimal K is at the boundary of state_range, extend the range and re-fit.
Expected: A clear minimum in BIC/AIC identifying the optimal number of hidden states. The selected model should have converged and have interpretable state meanings.
On failure: If no clear minimum exists (monotonically decreasing BIC), the model may be misspecified -- consider a different emission family. If all models have poor log-likelihood, the data may not follow an HMM structure.
6.1. Split data into training and validation sets (e.g., 80/20 or use multiple sequences if available).
6.2. Fit the model on training data. Compute log-likelihood on held-out data using the forward algorithm (do not re-fit parameters).
6.3. Posterior decoding (alternative to Viterbi):
z^_t = argmax_k(gamma[t,k])6.4. Compare Viterbi and posterior decoding:
6.5. Assess state interpretability:
A) are reasonable.6.6. Compute held-out log-likelihood per observation and compare across model orders to confirm the training-set model selection.
Expected: Held-out log-likelihood is reasonably close to training log-likelihood (no severe overfitting). Viterbi and posterior decoding agree on 90%+ of time steps. States have distinct, interpretable emission distributions.
On failure: If held-out likelihood is much worse than training, the model is overfitting -- reduce K or increase regularization. If states are not interpretable, try different initializations or a different emission family.
P(O) = sum_k(alpha[T,k]) = sum_k(pi[k] * b_k(o_1) * beta[1,k])O(K + d^2) parameters. Use BIC (not just likelihood) for model selection and validate on held-out data.