A Production Readiness Checklist for GenAI Systems

This checklist is for teams preparing to ship GenAI systems. It follows the production documentation pattern with core concepts, architecture, and validation steps.

Core GenAI Readiness Concepts

Contract compliance: input/output validation with enforced schemas. Pitfall: unvalidated outputs cause downstream failures.
Evaluation gate: pass/fail bar before release. Constraint: must be automated.
Cost guardrail: usage caps and alerts. Pitfall: spend grows nonlinearly without guardrails.

Architecture

A production‑ready GenAI system should have:

Input contract and validation layer.
Context builder with bounded retrieval.
Model runtime with retry/timeout policy.
Output validation and canonicalization.
Observability (logs/metrics/traces) and cost controls.

This design is required because model behavior is probabilistic and must be constrained by contracts, evaluation, and operational controls.

Readiness Review (How to Use This Checklist)

Run this checklist as a structured review with engineering, product, and operations. The goal is to block releases that have unknown risk. Each item should have an owner and a verification artifact (log, dashboard, or test output).

Step-by-Step Readiness Review

Step 1: Contracts and Limits

Purpose: prevent invalid inputs and unpredictable outputs.

input_schema_enforced: true
output_schema_enforced: true
max_context_chars: 12000
max_output_tokens: 400

Validation: schema violations are rejected and logged.

Step 2: Evaluation Gate

Purpose: block regressions.

def gate_eval(candidate_score, baseline_score, min_delta=0.02):
    if candidate_score < baseline_score + min_delta:
        raise RuntimeError("eval_gate_failed")

Validation: evaluation results stored with version metadata.

Step 3: Observability and Cost Controls

Purpose: operate safely in production.

BUDGET_DAILY_USD = 50
ALERT_THRESHOLD_USD = 45

def budget_ok(spend_today):
    if spend_today >= ALERT_THRESHOLD_USD:
        send_budget_alert()
    if spend_today >= BUDGET_DAILY_USD:
        raise RuntimeError("budget_exceeded")

Validation: budget alarms trigger within 5 minutes of breach.

Step 4: Rollout and Rollback Plan

Purpose: limit blast radius and enable quick recovery.

rollout:
  strategy: canary
  traffic: 10%
rollback:
  error_rate_threshold: 1.5%
  latency_p95_threshold_ms: 2000

Validation: rollback triggers are tested in staging.

Step 5: Security and Data Handling

Purpose: ensure compliance and data safety.

PII redaction in logs
Encryption at rest for stored prompts and outputs
Access control for model keys

Validation: security checklist signed off by platform owner.

Step 6: Incident Response and Runbooks

Purpose: reduce time to recovery during incidents.

On-call escalation policy defined
Runbooks for model outages and budget breaches
Predefined rollback steps

Validation: runbooks tested in staging drills.

Step 7: User Experience Safeguards

Purpose: avoid confusing or unsafe outputs.

Clear fallback messaging
Human handoff path documented
User-visible error codes

Validation: UX fallback tested with real error injections.

Readiness by Phase

Pre‑Production

Golden set created and reviewed
Prompt and schema versioned
Model deployment name fixed

Validation: pre‑production report stored with release metadata.

Production Launch

Canary traffic enabled
Error rate alerts configured
Budget guard active

Validation: canary succeeds for 24 hours without breaching thresholds.

Post‑Launch

Weekly drift analysis
Monthly cost review
Incident post‑mortems logged

Validation: drift reports are stored and reviewed.

Detailed Checklist by Domain

Data and Context

Retrieval sources documented and access controlled
Context limits enforced with hard caps
Source attribution logged for each request

Validation: retrieval logs show source IDs for every request.

Model Runtime

Timeouts configured
Retry limits enforced
Output validation gate active

Validation: runtime metrics confirm retries and timeouts are within limits.

Observability

Request IDs propagated end‑to‑end
Schema failure rate monitored
Cost per request tracked

Validation: dashboards show p95 latency and error rate by component.

Release and Rollback

Canary and shadow strategies documented
Rollback triggers configured
Previous release artifacts retained

Validation: rollback drills executed successfully.

Go/No‑Go Rubric

Release is blocked if any of the following are true:

Evaluation gate failed
Schema validation rate below 99%
Cost per request increased beyond threshold
Rollback plan not tested

Evidence Required for Release

Evaluation report with scores and thresholds
Canary monitoring dashboard link
Budget alarm configuration screenshot or config
Security review approval

This evidence should be attached to the release ticket.

Operational Readiness Questions

Answer these before launch:

Do you have a clear owner for model performance regressions?
Can you revert to the previous version within 30 minutes?
Are cost alerts routed to an on‑call channel?
Is the golden set representative of production traffic?
Are PII and sensitive data redacted from logs?
Are prompt changes reviewed like code changes?
Do you have a documented fallback response?
Are error budgets defined and tracked?
Is there a clear path to human handoff?
Are incident post‑mortems required?

Operational readiness is not just documentation. Teams should run a simulated incident before the first production launch to confirm the runbooks are usable under pressure.

Audit Trail Requirements

Store prompt version and hash with each request
Store dataset hash and evaluation score with each release
Retain logs for the required retention period

Validation: audit logs can reproduce a decision for a given request ID.

Compliance Notes

If your system handles regulated data, involve compliance early. Define which data can be logged, how long it is retained, and who can access it. Add automated checks to prevent unsafe logging in production.

Full Readiness Checklist (Condensed)

Input schema enforced
Output schema enforced
Context length capped
Prompt versioned
Dataset hash stored
Evaluation gate automated
Canary rollout defined
Rollback tested
Error rate alerts configured
Latency SLOs defined
Budget caps configured
Cost alerts wired
PII redaction verified
Access control validated
Logging retention configured
Incident runbook approved
Human handoff path documented

Release Sign‑Off

Final release approval should come from both engineering and operations. Engineering verifies correctness and evaluation results; operations verifies monitoring, alerts, and rollback readiness. If either group cannot sign off, the release does not proceed.

Change Management

Treat prompt and schema changes as API changes. Announce them, track them, and require review. For teams with multiple services consuming the output, publish a compatibility note and a deprecation window. This reduces surprise failures and keeps downstream teams aligned.

Post‑Launch Review

Within 7 days of release, review incident logs, cost changes, and user feedback. If drift or cost spikes are detected, pause new releases until mitigations are applied.

Risk Register (Example)

Risk: schema drift after prompt changes. Mitigation: validation gate and canonicalization.
Risk: cost overrun due to long contexts. Mitigation: hard caps and alerts.
Risk: silent quality regression. Mitigation: golden set and shadow evaluation.

Maintaining this register forces explicit ownership of production risks.

Review Cadence

Weekly: monitor drift and cost reports.
Monthly: refresh golden set and run extended evaluations.
Quarterly: review security and compliance controls.

Validation: reviews are logged and attached to operational metrics.

Runbook Contents (Minimum)

How to disable model traffic quickly
How to force fallback responses
How to identify the last known good release
Who to contact for platform issues

Training and Access

Ensure on‑call engineers have access to dashboards, logs, and the deployment system. A runbook is ineffective if the responder cannot execute rollback or view metrics. Validate access quarterly.

Service Degradation Plan

Define how the system behaves under load or failure:

Reduce optional features first
Disable expensive context retrieval
Route to fallback responses

Validation: degradation paths are tested in staging and included in the runbook.

Degradation should be reversible and logged. The system must return a clear status to the caller so downstream services can respond appropriately.

Dashboard Minimums

p95 latency by component
Schema validation failure rate
Retry rate and timeout rate
Cost per request and daily spend

Release Evidence Bundle

Evaluation report
Canary metrics summary
Rollback drill record
Security review sign‑off

This bundle should be attached to the release ticket and stored for audit.

Release Review Template

Release ID:
Prompt version:
Dataset hash:
Eval score:
Canary results:
Rollback tested:

Completing this template forces each release to document the minimum evidence for production readiness.

Example SLOs

p95 latency <= 1200ms
Schema failure rate <= 1%
Retry rate <= 3%
Budget variance <= 10% week over week

Budgeting Model (Practical)

Estimate monthly cost as:

Average input tokens * requests
Average output tokens * requests
Apply model pricing and add 10–20% buffer

Use this model to set daily caps and alert thresholds.

Ownership Map

Product: defines acceptable quality thresholds
Engineering: implements validation and rollback
Operations: owns monitoring and incident response

Clear ownership prevents stalled releases and unclear accountability.

Extended Readiness Areas

Data and Privacy

PII redaction verified in logs
Access control for model keys enforced
Data retention policy documented

Validation: privacy review completed and signed off.

Model Behavior Review

Golden set includes edge cases
Harmful output tests included
Baseline comparisons stored

Validation: model behavior report stored with release metadata.

Operational Ownership

On-call rotation defined
Escalation path documented
Budget owner assigned

Validation: operational owners acknowledged in release checklist.

Common Mistakes & Anti-Patterns

No evaluation gate: regressions ship silently. Fix: enforce gating in CI/CD.
No cost caps: spend grows unpredictably. Fix: set tenant budgets.
No fallback: failures become outages. Fix: define graceful degradation.

Testing & Debugging

Run golden set tests on every change.
Replay production failures from logs.
Compare output deltas across versions.

Trade-offs & Alternatives

Limitations: more engineering effort upfront.
When not to use: internal prototypes or research demos.
Alternatives: manual review workflows or staged rollouts only.

Production Readiness Checklist

Final Notes

This checklist is intentionally strict. Shipping without these controls usually creates hidden cost, reliability, and compliance debt. Treat readiness as a gate, not a suggestion, and re‑run the checklist whenever prompts, models, or data sources change.

If you cannot verify an item, assume it is not done and block the release until evidence exists.

This keeps production standards consistent across teams and releases.

Use it as the single source of truth for launch readiness.

Compliance required.