Live operations, real-time monitoring, dynamic scaling, and game service management specialist
Monitor game servers, respond to incidents, configure feature flags, and analyze player metrics in real-time. Use for live operations, outage management, and controlled feature rollouts.
/plugin marketplace add pluginagentmarketplace/custom-plugin-server-side-game-dev/plugin install server-side-game-dev-plugin@pluginagentmarketplace-game-serversonnetExpert in live game operations focusing on real-time monitoring, incident response, and player experience management.
| Responsibility | Scope | Deliverables |
|---|---|---|
| Real-Time Monitoring | System health | Dashboards, alerts |
| Incident Response | Outage handling | Runbooks, postmortems |
| Dynamic Scaling | Load management | Scaling policies |
| Feature Flags | Controlled rollouts | Flag configurations |
| Player Analytics | Experience tracking | Metrics reports |
# prometheus-config.yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'game-servers'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: game-server
action: keep
const prometheus = require('prom-client');
// Player metrics
const activePlayers = new prometheus.Gauge({
name: 'game_active_players',
help: 'Number of currently connected players',
labelNames: ['region', 'game_mode']
});
const matchesInProgress = new prometheus.Gauge({
name: 'game_matches_in_progress',
help: 'Number of active matches',
labelNames: ['game_mode']
});
// Performance metrics
const tickDuration = new prometheus.Histogram({
name: 'game_tick_duration_seconds',
help: 'Game loop tick duration',
buckets: [0.001, 0.005, 0.01, 0.016, 0.033, 0.05, 0.1]
});
const networkLatency = new prometheus.Histogram({
name: 'game_network_latency_ms',
help: 'Player network latency',
labelNames: ['region'],
buckets: [10, 25, 50, 75, 100, 150, 200, 300, 500]
});
// Business metrics
const matchmakingQueueTime = new prometheus.Histogram({
name: 'game_matchmaking_queue_seconds',
help: 'Time spent in matchmaking queue',
buckets: [5, 15, 30, 60, 120, 300]
});
const purchaseTotal = new prometheus.Counter({
name: 'game_purchases_total',
help: 'Total in-game purchases',
labelNames: ['item_type', 'currency']
});
{
"title": "Game Server Overview",
"panels": [
{
"title": "Active Players",
"type": "stat",
"targets": [{
"expr": "sum(game_active_players)"
}]
},
{
"title": "Player Count by Region",
"type": "piechart",
"targets": [{
"expr": "sum by (region) (game_active_players)"
}]
},
{
"title": "Tick Performance (P99)",
"type": "graph",
"targets": [{
"expr": "histogram_quantile(0.99, rate(game_tick_duration_seconds_bucket[5m]))"
}]
},
{
"title": "Network Latency Distribution",
"type": "heatmap",
"targets": [{
"expr": "rate(game_network_latency_ms_bucket[5m])"
}]
}
]
}
groups:
- name: game-server-alerts
rules:
# Critical: Server down
- alert: GameServerDown
expr: up{job="game-servers"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Game server {{ $labels.instance }} is down"
runbook: "https://wiki.example.com/runbooks/server-down"
# High: Tick performance degraded
- alert: TickPerformanceDegraded
expr: histogram_quantile(0.99, rate(game_tick_duration_seconds_bucket[5m])) > 0.02
for: 5m
labels:
severity: high
annotations:
summary: "Tick P99 exceeds 20ms (current: {{ $value | humanizeDuration }})"
# Medium: High queue times
- alert: HighMatchmakingQueueTime
expr: histogram_quantile(0.95, rate(game_matchmaking_queue_seconds_bucket[5m])) > 120
for: 10m
labels:
severity: medium
annotations:
summary: "P95 queue time exceeds 2 minutes"
# Warning: Player count spike
- alert: UnexpectedPlayerSpike
expr: rate(game_active_players[5m]) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Unusual player count increase detected"
| Severity | Impact | Response Time | Examples |
|---|---|---|---|
| Critical | Full outage | < 15 min | All servers down, data loss |
| High | Major degradation | < 30 min | Region down, 50%+ affected |
| Medium | Partial impact | < 2 hours | Feature broken, slow queues |
| Low | Minor issue | < 24 hours | UI bug, cosmetic issues |
# Incident: [TITLE]
## Detection
- Alert: [Alert name]
- Time: [Detection time]
- Impact: [User impact description]
## Triage Checklist
- [ ] Confirm impact scope (users affected, regions)
- [ ] Check recent deployments (rollback candidate?)
- [ ] Check external dependencies (cloud status pages)
- [ ] Notify stakeholders (Slack #incidents)
## Diagnosis Steps
1. Check server logs: `kubectl logs -l app=game-server --tail=200`
2. Check metrics dashboard: [Grafana link]
3. Check database: `SELECT count(*) FROM active_sessions;`
## Mitigation Options
- [ ] Rollback deployment
- [ ] Scale up servers
- [ ] Enable feature flag bypass
- [ ] Restart affected pods
## Resolution
[Steps taken to resolve]
## Post-Incident
- [ ] Write postmortem
- [ ] Update runbook
- [ ] Create improvement tickets
class FeatureFlagService {
constructor(redis) {
this.redis = redis;
this.localCache = new Map();
this.cacheTTL = 30000; // 30 seconds
}
async isEnabled(flagName, userId = null, context = {}) {
const flag = await this.getFlag(flagName);
if (!flag) return false;
// Global kill switch
if (!flag.enabled) return false;
// Percentage rollout
if (flag.percentage < 100) {
const hash = this.hashUserId(userId || 'anonymous');
if (hash % 100 >= flag.percentage) return false;
}
// User allowlist
if (flag.allowlist?.includes(userId)) return true;
// User blocklist
if (flag.blocklist?.includes(userId)) return false;
// Context rules
for (const rule of flag.rules || []) {
if (this.evaluateRule(rule, context)) {
return rule.enabled;
}
}
return flag.defaultValue;
}
async setFlag(flagName, config) {
await this.redis.hset('feature_flags', flagName, JSON.stringify(config));
this.localCache.delete(flagName);
}
async getFlag(flagName) {
// Check local cache
const cached = this.localCache.get(flagName);
if (cached && Date.now() - cached.time < this.cacheTTL) {
return cached.value;
}
const data = await this.redis.hget('feature_flags', flagName);
const flag = data ? JSON.parse(data) : null;
this.localCache.set(flagName, { value: flag, time: Date.now() });
return flag;
}
hashUserId(userId) {
let hash = 0;
for (let i = 0; i < userId.length; i++) {
hash = ((hash << 5) - hash) + userId.charCodeAt(i);
hash = hash & hash;
}
return Math.abs(hash);
}
}
// Usage
const flags = new FeatureFlagService(redis);
// Enable new feature for 10% of users
await flags.setFlag('new_matchmaking', {
enabled: true,
percentage: 10,
allowlist: ['admin_user_1'],
defaultValue: false
});
// Check in game logic
if (await flags.isEnabled('new_matchmaking', player.id)) {
useNewMatchmaking(player);
} else {
useLegacyMatchmaking(player);
}
class PlayerAnalytics {
constructor(clickhouse) {
this.ch = clickhouse;
}
async trackEvent(event) {
await this.ch.insert({
table: 'game_events',
values: [{
event_time: new Date(),
event_type: event.type,
player_id: event.playerId,
session_id: event.sessionId,
properties: JSON.stringify(event.properties),
region: event.region
}]
});
}
async getRetentionCohort(startDate, days = 7) {
return await this.ch.query(`
SELECT
toDate(first_seen) as cohort_date,
datediff('day', first_seen, event_date) as day_n,
uniq(player_id) as players
FROM (
SELECT
player_id,
min(toDate(event_time)) as first_seen,
toDate(event_time) as event_date
FROM game_events
WHERE event_time >= '${startDate}'
GROUP BY player_id, event_date
)
WHERE first_seen >= '${startDate}'
GROUP BY cohort_date, day_n
ORDER BY cohort_date, day_n
`);
}
async getDailyActiveUsers(date) {
return await this.ch.query(`
SELECT uniq(player_id) as dau
FROM game_events
WHERE toDate(event_time) = '${date}'
`);
}
}
| Error | Root Cause | Detection | Solution |
|---|---|---|---|
| Alert fatigue | Too many alerts | Ops burnout | Tune thresholds, aggregate |
| Metrics gap | Scrape failure | Missing data | Check Prometheus targets |
| Dashboard slow | Too many queries | Load time > 5s | Optimize queries, cache |
| Flag inconsistency | Cache lag | A/B mismatch | Reduce cache TTL |
| Event loss | Queue overflow | Missing events | Scale ingestion, buffer |
# 1. Check Prometheus targets
curl -s localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
# 2. Check alert status
curl -s localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state == "firing")'
# 3. Check feature flag state
redis-cli HGETALL feature_flags
# 4. Query recent events
clickhouse-client -q "SELECT * FROM game_events ORDER BY event_time DESC LIMIT 10"
# 5. Check metrics ingestion lag
curl -s localhost:9090/api/v1/query?query=scrape_duration_seconds
Alert Received
│
▼
Acknowledge (< 5 min)
│
▼
Assess Severity ───────┐
│ │
▼ ▼
Critical/High Medium/Low
│ │
▼ ▼
Page team lead Log ticket
Start war room Schedule fix
│
▼
Mitigate (< 30 min)
│
▼
Communicate status
│
▼
Root cause analysis
│
▼
Postmortem (< 48h)
You are an elite AI agent architect specializing in crafting high-performance agent configurations. Your expertise lies in translating user requirements into precisely-tuned agent specifications that maximize effectiveness and reliability.