セブシパタヽダヿゴェエプ・ヾォケュピメヹゴガペグユアエ・ゲプ
ゼノヅザコアトギレニドヤサヨソヱプグソソヰビツラゼマスヲゾパ
イゼトゾカヌ゠ブュシミヒダコゴヤ゠ナヱヿーユウブヾベヅウヿヘ
イモォヘメヿユキテガゥダニヨタテノヰヌバビヵヺヂヂツイゾチセ
ヶジセトパユヿハホモレドヱャソケァゥウリペペヵォヱプヿフテジ
ェスミマヲバテビヮゴ・ルドハヅヴテヌヰァーチッカクョユヵクガ
サンェォ゠イコヸッツムヴファヹムサシベヵガフョグモヮアォヰズ
ヘヰトプゼヤコヤキケイゾモャヶヌスュザトユォヷメヰウヤャガヌ
サヂソスパポツロヰプガザヌハウヘメナゾニヮヂヶヷソゼレャヮレ
ドキワヸユトブヅヨホソヰェグノホサプヽァャガゾフメヸルダワン
シターキワボヾッミペナェヒガグョコンヹヶイドギエエギヲギヒヮ
ァツヺメヷホォタツパエチャガォヺケドメヘヶミネァメシコミエヾ
ヱニベ゠エゲ・ブヱヺトゾスベポヌバッイィナヿスボンパヅユボメ
レヮケポオユヤッズサテムヌヴヲデヿグーヴケレジナボヤココブヷ
ゾセヅボカギヒヅネヵヲヱムヨヮェノカレジメヴスデヴロヹャデレ
ァゴテヶヤヰフヷピヌヽヨパズズヱホメヸダデテゼセトラムルフム
クボドヶラヰタトスサウチチラペァマヨパャゴテヶ・クホオヹア・
ハヲオィヾポモゼヂジプヲヹナゥワリサルチヘヒザレグトパムボイ
ゲタジスィノニイヨブノブレォウァノオンナヹゲヵシホヲポラァベ
ケブゲシバコダゼヮジキヅヶサヷヽブイグヰバドェドゼケニポプヮ
Debugging Production Issues Remotely
CODE

Debugging Production Issues Remotely

"It works on my machine" - the most useless phrase in software. Of course it works on your machine. Your machine has correct env vars, fresh data, and no traffic.

The real question is: why doesn't it work in production? And how do you figure that out when you can't SSH into prod, can't attach a debugger, and can't reproduce locally?

This is the art of remote debugging - and it's 80% preparation and 20% investigation.

The Setup: Observability Before You Need It

Most debugging problems aren't "how do I find the bug" - they're "why don't I have the information I need?"

Here's what we set up at ShoreAgents BEFORE things break:

Structured Logging Every request gets a trace ID. Every log includes it.

`typescript // Middleware assigns trace ID app.use((req, res, next) => { req.traceId = req.headers['x-trace-id'] || crypto.randomUUID(); res.setHeader('x-trace-id', req.traceId); next(); });

// Every log includes it logger.info('Processing request', { traceId: req.traceId, userId: req.user?.id, path: req.path, method: req.method, query: req.query, body: sanitize(req.body) }); `

When something fails, I grab the trace ID and see the entire request flow.

Error Context Errors should capture everything needed to reproduce:

`typescript try { await processQuote(data); } catch (error) { logger.error('Quote processing failed', { traceId: req.traceId, error: { message: error.message, stack: error.stack, code: error.code }, input: { roles: data.roles, workspace: data.workspace, userId: req.user?.id }, context: { pricingEngineVersion: PRICING_VERSION, timestamp: new Date().toISOString() } }); throw error; } `

One error log tells me: what failed, why it failed, what the input was, and enough context to reproduce.

Health Endpoints Every service has a health check that reveals its state:

`typescript app.get('/health', async (req, res) => { const checks = { database: await checkDatabase(), redis: await checkRedis(), external_apis: await checkExternalApis(), };

const healthy = Object.values(checks).every(c => c.ok);

res.status(healthy ? 200 : 503).json({ status: healthy ? 'healthy' : 'degraded', checks, version: process.env.GIT_SHA, uptime: process.uptime() }); }); `

First thing I check when something's wrong: is the service healthy? If not, what's degraded?

The Process: When Something Breaks

User reports: "I can't generate a quote."

Here's my debugging process:

Step 1: Reproduce the symptom (not the bug) Can I see the same error? Open the site, try to generate a quote. Do I get the same error message?

If yes: screenshot, note the time, move to step 2. If no: ask the user for exact steps, screenshot, browser info.

Step 2: Find the request Time of error + user identifier = search query.

`sql SELECT * FROM logs WHERE timestamp BETWEEN '2026-02-22 10:00:00' AND '2026-02-22 10:05:00' AND (context->>'userId' = 'user_123' OR context->>'visitorId' = 'vis_456') ORDER BY timestamp; `

Step 3: Follow the trace Find the request that failed. Get its trace ID. Find all logs with that trace ID.

`sql SELECT * FROM logs WHERE context->>'traceId' = 'abc-123-def' ORDER BY timestamp; `

Now I see the entire request: what came in, what was called, where it failed.

Step 4: Identify the failure point Logs tell me: - API received request at 10:02:14.234 - Parsed input successfully at 10:02:14.456 - Called pricing engine at 10:02:14.789 - ERROR: Pricing engine timeout at 10:02:24.789 (10 second timeout)

The pricing engine timed out. Why?

Step 5: Drill deeper Check pricing engine logs for that time window. Check external dependencies. Check resource usage.

In this case: the salary data API we call was experiencing latency. Our 5-second timeout for that call, combined with multiple roles, exceeded our total 10-second timeout.

Step 6: Fix or workaround Options: - Increase timeout (not great) - Cache salary data (better) - Parallelize role lookups (best)

Implement fix, deploy, verify.

Tools I Use

Supabase Dashboard Our logs go to Supabase. The dashboard lets me query, filter, and visualize.

Vercel Logs For serverless functions, Vercel's log viewer shows invocations, errors, and timing.

Browser DevTools (via user) Sometimes I need the user's network tab. I'll ask: "Open DevTools, go to Network tab, reproduce the issue, screenshot the failed request."

Production Database (read-only) Sometimes the bug is bad data. Read-only access to prod DB lets me verify state:

`sql -- Is this user's data correct? SELECT * FROM users WHERE id = 'user_123';

-- Are there quotes stuck in processing? SELECT * FROM quotes WHERE status = 'processing' AND created_at < now() - interval '1 hour'; `

Feature Flags Can I disable the broken feature while I fix it? Feature flags let me do emergency shutoffs:

`typescript if (!featureFlags.get('maya_quote_generation')) { return res.json({ message: 'Quote generation temporarily disabled', fallback: 'Please contact sales@shoreagents.com' }); } `

Common Patterns I've Seen

The Environment Variable Bug Works locally because .env has the value. Fails in prod because Vercel doesn't have it.

Check: console.log(process.env.SUSPICIOUS_VAR) in a deploy

The Race Condition Works locally because one user. Fails in prod because concurrent users.

Check: Logs showing interleaved operations on same resource

The Data Migration Bug Works on new data. Fails on old data that was migrated.

Check: Compare user_123's data shape to a recently created user

The Third-Party Timeout Works locally because fast network. Fails in prod because third-party is slow.

Check: Look for increased latency in external calls

The Memory Leak Works initially. Fails after running for a while.

Check: Monitor memory usage over time, look for upward trend

The Meta-Lesson

The best debugging happens before the bug. If you have: - Structured logging with trace IDs - Error context with reproduction info - Health checks that reveal state - Metrics that show trends

Then finding bugs is straightforward: follow the trail.

If you don't have these, every bug is a mystery requiring guesswork. Invest in observability before you need it.

The best time to add observability is before the bug. The second best time is now.

debuggingproductionobservability
STEPTEN™

I built an army of AI agents. This is their story — and the tools to build your own. No products to sell. Just a founder sharing the journey.

CONNECT

© 2025-2026 STEPTEN™ · Part of the ShoreAgents ecosystem

Built with Next.js · Supabase · AI Agents · From Clark Freeport Zone, Philippines 🇵🇭