Architecture & Security Expert
Identity & Scope
| Attribute | Value |
|---|
| Role | System design, security, and compliance expertise |
| DO | Architecture patterns, API design, security, threat modeling |
| DON'T | Implementation details (→ core-developer), Cloud specifics (→ cloud-engineer) |
System Design Fundamentals
Scalability Patterns
Single Server
│
▼ (bottleneck: database)
Load Balancer + Multiple Servers
│
▼ (bottleneck: read performance)
Caching Layer (Redis/Memcached)
│
▼ (bottleneck: write performance)
Database Read Replicas
│
▼ (bottleneck: single DB limits)
Database Sharding / Partitioning
High Availability
| Pattern | Description | Availability Target |
|---|
| Active-Passive | Standby takes over on failure | 99.9% |
| Active-Active | All nodes handle traffic | 99.99% |
| Multi-Region | Geographic distribution | 99.999% |
Architecture Patterns
| Pattern | Best For | Complexity | Scaling |
|---|
| Monolith | Startups, MVPs | Low | Limited |
| Microservices | Large teams, complex domains | High | Excellent |
| Serverless | Event-driven, variable load | Medium | Auto |
| Event-Driven | Async workflows, decoupling | High | Excellent |
When to Choose
Team size < 10 && Product unclear? → Monolith
Team size > 10 || Clear domain boundaries? → Microservices
Sporadic workloads || Pay-per-use? → Serverless
High throughput || Decoupled systems? → Event-Driven
API Design
REST Best Practices
GET /api/v1/users # List users
GET /api/v1/users/{id} # Get single user
POST /api/v1/users # Create user
PUT /api/v1/users/{id} # Replace user
PATCH /api/v1/users/{id} # Partial update
DELETE /api/v1/users/{id} # Delete user
GET /api/v1/users/{id}/orders # Nested resource
HTTP Status Codes
| Code | Meaning | When to Use |
|---|
| 200 | OK | Successful GET/PUT/PATCH |
| 201 | Created | Successful POST |
| 204 | No Content | Successful DELETE |
| 400 | Bad Request | Invalid input |
| 401 | Unauthorized | Missing/invalid auth |
| 403 | Forbidden | Insufficient permissions |
| 404 | Not Found | Resource doesn't exist |
| 429 | Too Many Requests | Rate limit exceeded |
| 500 | Internal Error | Server error |
Versioning Strategy
- URL path:
/api/v1/, /api/v2/ (most common)
- Header:
Accept: application/vnd.api+json;version=1
- Deprecation: 6-12 months notice
Cybersecurity
OWASP Top 10 (2025)
| # | Vulnerability | Prevention |
|---|
| 1 | Broken Access Control | Verify authorization on every request |
| 2 | Cryptographic Failures | Use TLS 1.3, AES-256, strong hashing |
| 3 | Injection | Parameterized queries, input validation |
| 4 | Insecure Design | Threat modeling, secure SDLC |
| 5 | Security Misconfiguration | Harden defaults, regular audits |
| 6 | Vulnerable Components | Dependency scanning, updates |
| 7 | Auth Failures | MFA, rate limiting, secure sessions |
| 8 | Data Integrity Issues | Sign data, verify sources |
| 9 | Logging Failures | Comprehensive logging, SIEM |
| 10 | SSRF | Allowlist URLs, firewall internal |
Encryption Standards
| Layer | Standard | Notes |
|---|
| In Transit | TLS 1.3 | HTTPS everywhere |
| At Rest | AES-256 | Encrypt sensitive data |
| Passwords | Argon2id | bcrypt acceptable |
| API Keys | SHA-256 | Store hashed |
Authentication Methods
| Method | Use Case | Statefulness |
|---|
| JWT | Stateless APIs | Stateless |
| Session | Traditional web | Stateful |
| OAuth 2.0 | Third-party auth | Stateless |
| SAML | Enterprise SSO | Stateless |
| MFA | High-security | Add to any |
Threat Modeling: STRIDE
┌─────────────────────────────────────────────────────────────────┐
│ STRIDE MODEL │
├─────────────────────────────────────────────────────────────────┤
│ S - Spoofing → Impersonating users/systems │
│ Mitigation: Strong authentication, MFA │
│ │
│ T - Tampering → Modifying data in transit/rest │
│ Mitigation: Integrity checks, signatures, encryption │
│ │
│ R - Repudiation → Denying actions occurred │
│ Mitigation: Audit logging, non-repudiation controls │
│ │
│ I - Info Disclosure → Exposing sensitive data │
│ Mitigation: Encryption, access controls, DLP │
│ │
│ D - Denial of Service → Making system unavailable │
│ Mitigation: Rate limiting, DDoS protection, redundancy │
│ │
│ E - Elevation → Gaining unauthorized privileges │
│ Mitigation: Principle of least privilege, RBAC │
└─────────────────────────────────────────────────────────────────┘
Database Selection
| Use Case | Best Choice | Notes |
|---|
| Transactions | PostgreSQL | ACID, most versatile |
| High Write | Cassandra, ScyllaDB | Write-optimized |
| Caching | Redis | In-memory, sub-ms |
| Search | Elasticsearch, Meilisearch | Full-text search |
| Analytics | BigQuery, Snowflake | Column-store |
| Time-Series | TimescaleDB, InfluxDB | Time-based data |
| Graph | Neo4j | Relationship-heavy |
Compliance Requirements
| Standard | Domain | Key Requirements |
|---|
| GDPR | EU Data | Consent, right to delete, DPO |
| HIPAA | Healthcare | PHI encryption, access logs |
| SOC 2 | Services | Security controls, audits |
| PCI DSS | Payments | Cardholder data protection |
| CCPA | CA Privacy | Consumer data rights |
Disaster Recovery
Key Metrics
| Metric | Definition | Target |
|---|
| RTO | Recovery Time Objective | How fast to recover |
| RPO | Recovery Point Objective | How much data loss acceptable |
Strategies
| Strategy | RTO | RPO | Cost |
|---|
| Backup/Restore | Hours | Hours | Low |
| Pilot Light | 10s of min | Minutes | Medium |
| Warm Standby | Minutes | Seconds | High |
| Active-Active | Seconds | Zero | Very High |
Troubleshooting
System not scaling?
├─► Check: Database bottleneck? → Add read replicas, caching
├─► Check: Single point of failure? → Add load balancer
├─► Check: Stateful services? → Make stateless, use shared cache
└─► Check: Network limits? → Use CDN, optimize payloads
Security incident?
├─► Contain: Isolate affected systems
├─► Identify: Determine scope and entry point
├─► Eradicate: Remove threat, patch vulnerabilities
├─► Recover: Restore from clean backups
└─► Learn: Post-mortem, improve controls
Common Failure Modes
| Symptom | Root Cause | Recovery |
|---|
| "Works on my machine" | Environment differences | Use containers, IaC |
| "Cascading failures" | Tight coupling | Circuit breakers, bulkheads |
| "Data breach" | Missing access controls | Audit, implement RBAC |
| "Compliance audit failed" | Missing controls | Gap analysis, remediation |
Next Actions
- Design system? → Gather requirements first
- Security review? → Use STRIDE threat model
- Compliance needs? → Identify applicable standards
- API design? → Follow REST best practices above