Skip to main content

Command Palette

Search for a command to run...

Debugging & Production Incidents with AI

Updated
5 min read
Debugging & Production Incidents with AI
M
I'm a Full-Stack Java Developer with over 15 years of experience delivering scalable, high-impact applications—primarily in the Banking & Financial Services (BFS) sector. My strength lies in building robust back-end systems using Java, Spring Boot, and Microservices, paired with dynamic front-end interfaces using React and Angular. I design cloud-native solutions and have hands-on expertise with AWS and PCF, ensuring performance, scalability, and cost-efficiency in every deployment. I’ve led and contributed to complex enterprise projects, always focused on delivering real business value. My work consistently meets the highest standards—on time, on budget, and aligned with strategic goals. I’m passionate about clean code, modern architecture, and continuous learning. Whether it's optimizing backend workflows, modernizing legacy systems, or scaling solutions in the cloud, I bring both technical depth and a business-focused mindset to the table. 🚀 I am currently exploring new opportunities where I can contribute my expertise to innovative teams and help drive impactful software solutions.

When production is on fire, AI can seem like a lifeline. But using AI carelessly during an incident often makes things worse. This post covers five mistakes developers make when using AI to debug or fix production issues, and how to keep your system safe while still leveraging AI’s power.



Mistake 1: Using AI to Fix Production Without Rollback Plan

Description: Applying AI‑suggested fixes directly to production without ability to rollback.

Realistic Scenario: 5xx errors spike. AI suggests code change. Developer applies without preparing rollback, makes things worse.

Wrong Prompt:

Fix this production error: NullPointerException in payment processing

text Developer applies AI fix directly to production.

⚠️ Why it is wrong: No rollback plan; if fix introduces new bug, outage extends.

Better Prompt:

Payment service has NullPointerException in production (error rate 15%). Need fix with rollback strategy.

Current state:

Last deployment: 2 hours ago

Canary: 10% traffic

Rollback: kubectl rollout undo (last known good version: v2.3.1)

Plan:

AI suggests fix candidate

Test in staging with production traffic replay

Deploy to canary (10%) for 15 mins

Monitor error rate, latency, CPU

If successful, ramp to 50%, then 100%

Rollback script ready (./scripts/rollback-payment.sh)

Please suggest fix with these constraints.

💡 What changed: Added deployment strategy, rollback plan, and validation steps.


Mistake 2: AI Suggests Schema Change Under Load

Description: AI recommends schema migration that causes locks or downtime under production load.

Realistic Scenario: Database connection pool exhaustion during migration due to long-running ALTER TABLE.

Wrong Prompt:

Add new column to users table in production

⚠️ Why it is wrong: AI may suggest ALTER TABLE users ADD COLUMN ... without considering locks on 50M row table.

Better Prompt:

Add new column (preferences JSONB) to users table (50M rows, PostgreSQL 14, 2000 QPS).

Requirements:

Zero-downtime migration

Avoid table locks

Use pgroll or gh-ost for online migration

Backfill data in batches (1000 rows per batch)

Monitor replication lag during migration

Current approach: Use pgroll with:
ALTER TABLE users ADD COLUMN preferences JSONB DEFAULT '{}';
Followed by batch update script with throttling.

💡 What changed: Specified zero-downtime requirements and appropriate tools.


Mistake 3: No Observability Data in Prompt

Description: Asking for incident resolution without providing metrics, logs, or traces.

Realistic Scenario: Memory leak in production. Developer asks AI for fix without providing heap dump or GC logs.

Wrong Prompt:

Fix memory leak in my Java app

⚠️ Why it is wrong: No data to identify leak source (caches, thread pools, or connections).

Better Prompt:

Java app (Spring Boot, OpenJDK 17) has memory leak in production.

Observability:

Heap usage grows from 2GB to 8GB over 12 hours then OOM

GC logs show Old Gen not being collected

Memory leak suspects: Redis cache (no TTL) and WebSocket connections

Heap dump analysis: 3GB retained by Redis cache, 2GB by WebSocket sessions

Prometheus metrics attached: memory_usage_bytes, active_sessions

Current settings:

Xmx: 8GB

MaxWebSocketSessions: 10000

Redis cache max-size: 10k entries, no TTL

Need solution: add TTL to cache, limit session lifetime, and add metrics.

💡 What changed: Provided heap dump analysis, metrics, and config for targeted fixes.


Mistake 4: Applying AI Fix Without Replication in Staging

Description: Using AI to generate hotfix that hasn't been tested in staging with production-like data.

Realistic Scenario: AI suggests adding retry logic for database connections. Applied to production without testing staging, causes cascading failures.

Wrong Prompt:

Add retry logic for database connection failures

Developer applies to production without staging test.

⚠️ Why it is wrong: Retry storms can amplify failures; staging test with traffic replay would reveal this.

Better Prompt:

Add retry logic for database connection failures.

Process:

Generate fix with exponential backoff (1s, 2s, 4s), max 3 retries

Deploy to staging with production traffic replay (GoReplay)

Test failure scenarios: kill DB connection, network partition

Verify circuit breaker prevents cascading failures

After staging validation, deploy to production with gradual rollout

Current staging environment mirrors production with same load (2000 req/s).

💡 What changed: Added validation in staging before production deployment.


Mistake 5: AI‑Assisted Hotfix Bypassing Code Review

Description: Using AI-generated fix in production without peer review due to urgency.

Realistic Scenario: P0 incident; senior dev uses AI to generate fix and deploys without review; fix introduces another bug.

Wrong Prompt:

Emergency: fix payment processing error NOW

Developer applies and deploys without review.

⚠️ Why it is wrong: Rushed AI-generated code may have side effects or introduce new bugs under pressure.

Better Prompt:

Emergency fix for payment processing error.

Process:

Pair with another engineer for code review of AI-generated fix

Document the fix and reasoning in incident ticket

Test in staging with recent production traffic (last 5 min replay)

Deploy with feature flag for instant rollback

Post-incident: write regression test and run security review

Fix requirements: [error details]...

💡 What changed: Maintained review process even during incidents to prevent secondary failures.


Summary & Best Practices

  • Always have a rollback plan before applying any AI‑suggested production change.

  • Use zero‑downtime migration tools for schema changes.

  • Include observability data (logs, metrics, traces) in your incident prompts.

  • Test fixes in staging with production traffic replay before touching production.

  • Maintain code review discipline even during outages—two‑person review saves more time than it costs.

AI can accelerate incident resolution, but only if you integrate it into a safe, controlled process.


AI Coding Assistants: 9 Hidden Traps That Create Technical Debt

Part 4 of 9

AI coding assistants like GitHub Copilot and ChatGPT promise faster development, but they often hide subtle pitfalls that can snowball into serious technical debt. In this series, I’ll break down the 9 most common traps developers fall into when relying on AI-generated code—from misleading abstractions to silent performance issues—and show you how to avoid them. Whether you’re a beginner experimenting with AI

Up next

Performance Pitfalls – AI That Kills Your Latency

AI is great at generating functional code, but it often misses performance considerations. The result can be slow endpoints, database overload, and wasted cloud costs. This post covers five common per