From trading-operations
Guides order lifecycle management in trading systems: OMS/EMS state machines, FIX protocol connectivity, cancel/replace handling, validation rules, order types, audit trails, troubleshooting, and recovery.
npx claudepluginhub joellewis/finance_skills --plugin trading-operationsThis skill uses the workspace's default tool permissions.
Guide the design and implementation of order lifecycle management in trading systems. Covers order states and transitions, FIX protocol message flows, order types and time-in-force instructions, cancel/replace workflows, order validation, and state machine design. Enables building or evaluating order management systems that correctly handle the full lifecycle from order creation through fill, c...
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
Guide the design and implementation of order lifecycle management in trading systems. Covers order states and transitions, FIX protocol message flows, order types and time-in-force instructions, cancel/replace workflows, order validation, and state machine design. Enables building or evaluating order management systems that correctly handle the full lifecycle from order creation through fill, cancellation, or expiration.
11 — Trading Operations (Order Lifecycle & Execution)
both
The order state machine is the central abstraction in any order management system. It defines every state an order can occupy and every valid transition between states. A correctly implemented state machine prevents impossible transitions (such as filling a canceled order), ensures audit trail completeness, and provides the foundation for order status reporting to clients, counterparties, and regulators.
Canonical order states:
Terminal vs. non-terminal states: Terminal states (Filled, Canceled, Replaced, Rejected, Expired) represent the end of an order's lifecycle — no further transitions are possible. Non-terminal states (New, Pending New, Accepted, Partially Filled, Pending Cancel, Pending Replace, Suspended, Done for Day) may transition to other states. The state machine must enforce the invariant that no transition out of a terminal state is ever permitted.
Valid state transitions (representative, not exhaustive):
State persistence and recovery: The order state must be persisted durably — typically to a database or write-ahead log — before any acknowledgment is sent to the order originator or any action is taken on the order. If the OMS restarts after a crash, it must be able to reconstruct the current state of every active order from persisted state plus any messages received from execution venues during recovery. This requires idempotent message processing (handling duplicate execution reports without double-counting fills) and state reconciliation with venue order status queries.
Order types define the execution instructions that govern how an order interacts with the market.
Time-in-force (TIF) instructions specify how long an order remains active before it is automatically canceled or expires.
Behavior at market close: DAY orders are canceled. GTC and GTD orders transition to Done for Day and are reactivated the next trading day. IOC and FOK orders, by definition, will have already been filled or canceled before close. MOC and LOC orders execute during the closing auction. The OMS must correctly handle each TIF at the end-of-day transition, including generating appropriate cancel confirmations for expired DAY orders and updating state for multi-day orders.
Overnight handling: GTC orders that are Done for Day must be resubmitted or reactivated at the venue on the next trading day. Some venues maintain GTC orders natively; others require the OMS to resubmit them each morning. The OMS must track which GTC orders need resubmission and handle the resubmission process as part of the start-of-day workflow.
The Financial Information eXchange (FIX) protocol is the dominant standard for electronic trading communication. Understanding FIX is essential for building or integrating with any execution venue, broker, or counterparty.
FIX message types for order flow:
Key FIX tags:
FIX session vs. application layer: FIX operates on two layers. The session layer handles connection management, heartbeats, sequence number tracking, and message recovery (gap fill and resend requests). The application layer handles business messages (orders, executions, cancels). A robust FIX implementation must handle session-level events correctly: sequence number resets, message gaps, logon/logout negotiation, and heartbeat monitoring. A lost FIX session requires reconnection and sequence number reconciliation before application-level messaging can resume.
FIX versions: FIX 4.2 remains widely deployed and is the baseline for many venues. FIX 4.4 added improvements including better support for multi-leg orders and allocation messaging. FIX 5.0 introduced the FIXT transport layer (separating session and application protocols) and added support for market data and post-trade messaging. When connecting to a new venue, confirm which FIX version and which message extensions (if any) the venue supports.
Cancel and replace workflows are among the most operationally sensitive parts of order lifecycle management. They involve concurrent state changes, race conditions, and the possibility of unexpected outcomes.
Cancel request flow:
Replace (amend) request flow:
Race conditions — cancel vs. fill: The most critical race condition occurs when a cancel request and a fill cross in flight. The OMS sends a cancel request, but before the venue processes it, the order fills (fully or partially). The venue may respond with a fill ExecutionReport followed by an OrderCancelReject (because the order is now filled and cannot be canceled), or with both a fill and a cancel confirmation (if only a partial fill occurred and the remaining quantity was canceled). The OMS must handle all possible message orderings:
Order chaining (ClOrdID to OrigClOrdID linking): Each cancel or replace creates a new link in the order chain. The original order has ClOrdID=A. A replace request references OrigClOrdID=A and assigns ClOrdID=B. A subsequent replace references OrigClOrdID=B and assigns ClOrdID=C. The OMS must maintain this chain to correctly correlate all messages belonging to the same logical order. Breaking the chain — for example, referencing the wrong OrigClOrdID — will cause the venue to reject the request or, worse, cancel or replace the wrong order.
Pending state discipline: While an order is in Pending Cancel or Pending Replace, the OMS should not submit additional cancel or replace requests for the same order. Submitting concurrent cancel/replace requests creates ambiguity about which request the venue is processing and can lead to unexpected outcomes. Queue any new cancel or replace intent until the pending request is resolved.
Order validation is the set of checks performed before an order is submitted to an execution venue. Thorough validation catches errors early, prevents rejections at the venue, and enforces risk management and compliance constraints.
Pre-submission validation (OMS-level):
Exchange-level validation: Even after OMS validation, the exchange performs its own checks: valid symbol for the venue, order type supported by the venue, price within the venue's price band (limit-up/limit-down), quantity within the venue's maximum order size, and participant permissions. Exchange rejections result in a FIX Reject or ExecutionReport with ExecType=Rejected and a reason code.
Reject handling and error codes: When an order is rejected — either by the OMS or by the venue — the rejection reason must be captured, logged, and communicated to the order originator. FIX Tag 103 (OrdRejReason) provides standardized rejection codes: broker/exchange option (0), unknown symbol (1), exchange closed (2), order exceeds limit (3), too late to enter (4), unknown order (5), duplicate order (6), and others. The OMS should map venue-specific rejection codes to actionable error messages for traders and operations staff.
Trading systems must support orders that involve multiple legs or contingent execution logic.
Regulatory requirements mandate comprehensive audit trails for all order activity.
Consolidated Audit Trail (CAT): CAT, which replaced FINRA's OATS (Order Audit Trail System), requires broker-dealers and certain other participants to report detailed lifecycle events for every order in NMS securities and listed options. Reportable events include order receipt, order origination, order routing, order modification (cancel/replace), order execution, and order cancellation. CAT requires customer identification at the point of order origination, enabling regulators to trace every order from inception through execution or cancellation, across all venues and intermediaries.
Timestamp precision: CAT requires timestamps with millisecond precision at minimum, and many firms capture microsecond or nanosecond precision for internal analytics and compliance. Clock synchronization across all systems in the order flow is essential — FINRA Rule 4590 requires clocks to be synchronized within specified tolerances (generally one second for manual events, 50 milliseconds for electronic events). Timestamp drift between the OMS, FIX gateway, and execution venues can create audit trail inconsistencies that are difficult to resolve.
Order event logging: Every state transition, every message sent, and every message received must be logged with a timestamp, the message content (or key fields), and the system component that processed the event. The log must be immutable — entries cannot be modified or deleted after creation. This event log forms the basis for regulatory reporting, dispute resolution, and operational forensics.
Reconstruction capability: Regulators may request a complete reconstruction of order activity for a specific time period, security, account, or trader. The audit trail must support reconstruction at any level of granularity: a single order's complete lifecycle, all orders for a security during a trading session, or all orders originated by a specific desk or individual. Reconstruction requires correlating OMS records, FIX message logs, execution venue reports, and clearing/settlement records.
Scenario: A mid-size broker-dealer is building a new order management system to replace a legacy platform. The legacy system had a flat order status field with values like "OPEN," "DONE," and "ERROR" — insufficient for proper lifecycle tracking. The new OMS must implement a rigorous state machine that handles all order types, supports FIX connectivity to multiple execution venues, and satisfies CAT reporting requirements.
Design approach:
The engineering team starts by defining the state enumeration. Drawing from FIX OrdStatus values and operational requirements, they establish 13 states: New, PendingNew, Accepted, PartiallyFilled, Filled, PendingCancel, Canceled, PendingReplace, Replaced, Rejected, Expired, Suspended, and DoneForDay. Each state is categorized as terminal (Filled, Canceled, Replaced, Rejected, Expired) or non-terminal (all others).
The transition table is implemented as an explicit allowlist. Rather than permitting any transition not explicitly forbidden (a dangerous pattern that allows invalid states through omissions), the system defines every permitted transition as a pair (from_state, to_state) with an associated trigger event (typically a FIX ExecType or an internal event). Any transition not in the allowlist is rejected and logged as an error. The transition table contains approximately 25 to 30 valid transitions.
For state persistence, the team selects a write-ahead log (WAL) pattern. Before processing any inbound message (FIX ExecutionReport, cancel acknowledgment, etc.), the system writes the pending state transition to a durable log. If the system crashes mid-transition, the recovery process replays the WAL from the last checkpoint, applying each transition idempotently. Idempotency is achieved by assigning a unique event identifier (based on the FIX message sequence number and session identifier) to each transition and checking for duplicates during replay.
The state machine handles the cancel-vs-fill race condition explicitly. When an order is in PendingCancel and a fill ExecutionReport arrives, the system processes the fill first (transitioning to PartiallyFilled or Filled), then evaluates whether the cancel request is still relevant. If the order is now Filled, the cancel is abandoned and the CancelReject is expected. If the order is PartiallyFilled, the cancel may still succeed for the remaining quantity. The system never drops a fill message — fills are processed with highest priority regardless of pending cancel/replace state.
For CAT compliance, every state transition generates an audit event record containing: the order identifier (ClOrdID and OrderID), the previous state, the new state, the trigger event (FIX message type and key fields), the timestamp (microsecond precision, synchronized per FINRA Rule 4590), and the system component that processed the transition. These events are written to an append-only audit log and are the source data for CAT reporting.
Analysis:
The explicit-allowlist approach for state transitions is preferred over a denylist because it fails safely — a missing transition results in a rejected event (which is logged and investigated) rather than a silently accepted invalid transition. The WAL pattern ensures no state changes are lost during crashes, and idempotent replay handles the case where a message was partially processed before the crash. The cancel-vs-fill race handling prioritizes fill processing because fills represent irrevocable financial events — a fill that is dropped or delayed can cause position discrepancies, P&L errors, and regulatory issues.
Scenario: A proprietary trading desk is experiencing issues with its cancel/replace workflow. Traders frequently amend limit order prices as the market moves, but the current system occasionally produces inconsistent states: orders that appear canceled but have unrecognized fills, or replace requests that reference stale ClOrdIDs and are rejected by the venue. The desk needs a redesigned cancel/replace implementation that correctly handles all race conditions.
Design approach:
The root cause analysis reveals three problems. First, the system is not maintaining the ClOrdID chain correctly — when a replace request is submitted, the system updates the order's ClOrdID immediately rather than waiting for the venue's confirmation. If the replace is rejected, the order's ClOrdID no longer matches what the venue has on record, and subsequent requests fail. Second, the system permits concurrent cancel/replace requests — a trader can submit a price amendment while a previous amendment is still pending, creating ambiguity at the venue. Third, fill messages arriving during a Pending Replace state are being deferred rather than processed immediately, causing position tracking to lag.
The redesign addresses each problem:
For ClOrdID management, the system maintains two identifiers per order: the "active ClOrdID" (the ClOrdID currently acknowledged by the venue) and the "pending ClOrdID" (the ClOrdID of an outstanding cancel/replace request, if any). The active ClOrdID is updated only when the venue confirms the replace (ExecutionReport with ExecType=Replaced). If the replace is rejected (OrderCancelReject), the pending ClOrdID is discarded and the active ClOrdID remains unchanged. All new requests to the venue reference the active ClOrdID as the OrigClOrdID.
For concurrency control, the system enforces a strict one-pending-request rule. While a cancel or replace request is outstanding (order is in PendingCancel or PendingReplace), new cancel/replace requests from the trader are queued internally. When the pending request is resolved (confirmed or rejected), the system dequeues the next request (if any) and submits it. If the queued request conflicts with the resolution (e.g., the trader queued a price change to $50.10 but the order filled while the previous request was pending), the queued request is discarded and the trader is notified.
For fill processing during pending states, the system processes fill ExecutionReports immediately regardless of pending cancel/replace status. If the order is in PendingReplace and a fill arrives, the fill is applied (cumulative quantity and average price are updated, the order may transition to PartiallyFilled or Filled). If the fill completes the order (Filled), the pending replace is moot. If the fill is partial, the pending replace may still succeed, but the OMS recalculates whether the replace request's quantity is still valid (the new quantity must be greater than or equal to the cumulative filled quantity; otherwise the venue will reject the replace).
Analysis:
The two-identifier pattern (active ClOrdID and pending ClOrdID) eliminates the stale-reference problem because the system always knows which ClOrdID the venue considers current. The one-pending-request rule eliminates venue-side ambiguity and simplifies the OMS state machine. The immediate fill processing during pending states ensures that position tracking is always current, even when cancel/replace messages are in flight. Together, these patterns handle the fundamental race condition of cancel/replace workflows: the unavoidable latency window between sending a request and receiving the venue's response, during which fills and other events may occur.
Scenario: A buy-side firm is establishing FIX connectivity to a new electronic communication network (ECN) to access additional liquidity for its equity trading strategies. The firm's existing OMS supports FIX 4.2 connections to two other venues. The new ECN supports FIX 4.4 and has specific requirements for message formatting, session management, and order handling.
Design approach:
The implementation proceeds through four phases: session certification, application message mapping, exception handling, and production cutover.
Session certification: The ECN provides a certification (test) environment with a FIX acceptor endpoint. The firm's FIX engine (the initiator) must establish a session by negotiating protocol version, sender and target CompIDs, heartbeat interval, and sequence number handling. The certification process validates that the FIX engine correctly handles: logon and logout sequences, heartbeat exchange (including detection of missed heartbeats and test request/heartbeat recovery), sequence number synchronization (including gap detection and resend requests), and message-level rejection (MsgType=3, Reject) for malformed messages. Session certification typically takes one to two weeks and requires multiple rounds of testing. Common session-level issues include: incorrect CompID configuration, heartbeat interval mismatch (the ECN expects 30 seconds; the firm's engine is configured for 60), and sequence number reset policy disagreements (some venues require a daily sequence number reset at a specific time; others maintain continuous sequence numbers).
Application message mapping: FIX 4.4 introduces fields and message structures not present in FIX 4.2. The firm must map its internal order representation to the ECN's specific FIX 4.4 requirements. Key mapping considerations include: the ECN may require specific values in Tag 1 (Account) that differ from the firm's internal account identifiers; the ECN may support order types or time-in-force values that the firm's other venues do not (or may not support order types that the firm uses elsewhere); the ECN may use custom tags (user-defined tags in the 5000+ range) for venue-specific features such as order routing preferences or self-trade prevention instructions; execution report processing must handle the ECN's specific usage of ExecType and OrdStatus, which may differ subtly from other venues (for example, some venues use ExecType=Trade for fills while others use ExecType=Fill, which was introduced in FIX 4.4 as a clearer alternative to the overloaded ExecType=Trade value from FIX 4.2).
Exception handling: The connection must handle operational exceptions gracefully. Network disconnections require automatic reconnection with exponential backoff. During disconnection, the OMS must track which orders are "in flight" at the ECN — orders that were submitted but whose status is unknown due to the disconnection. Upon reconnection, after sequence number synchronization and gap fill processing, the OMS sends OrderStatusRequest messages for all in-flight orders to reconcile OMS state with venue state. Venue-side order cancellation (the ECN cancels orders unilaterally during a system event or end-of-day) must be detected and processed — the OMS cannot assume that an order remains active at the venue just because no cancel confirmation was received. Drop copy connections (a secondary FIX session that receives copies of all ExecutionReports) provide redundancy: if the primary session drops a message, the drop copy catches it.
Production cutover: Before going live, the firm conducts a parallel run: orders are submitted to the new ECN while the same orders are priced (but not executed) against the existing venues to compare execution quality. The cutover plan includes: a rollback procedure (ability to stop routing to the new ECN and revert to existing venues within minutes), monitoring dashboards that track rejection rates, fill rates, and latency in real time during the first days of live trading, and an escalation path to the ECN's market operations desk for production support.
Analysis:
FIX connectivity projects are deceptively complex. The protocol standard is well-defined, but each venue interprets and extends it differently. The certification process is essential for discovering venue-specific behaviors before they cause production incidents. The most common production issues with new FIX connections are: sequence number desynchronization after an unclean disconnect (requiring manual intervention to agree on a reset point), message format differences that pass certification but cause sporadic rejections under production load (e.g., a field that the ECN expects only for certain order types), and latency spikes during high-volume periods that trigger heartbeat timeouts and session disconnects. Robust monitoring and automated reconnection logic are as important as correct message formatting.