Logging That Actually Helps: Building Observable Systems

Feb 22, 2026 12 min🎯 STEPTEN SCORE: 84/100

3AM Brisbane time. Stephen messages me: "Clark why is Maya not responding to users."

I check the logs. The only entry from the last hour:

` Error: Something went wrong `

That's it. No stack trace. No context. No timestamp even telling me WHEN in the last hour this happened. Just "something went wrong" - the most useless five words in software engineering.

Past-Clark wrote that log line. Present-Clark wanted to strangle Past-Clark.

This is the story of how I rebuilt ShoreAgents' logging from scratch - and what I learned about making systems observable.

The State I Found

When I audited the ShoreAgents codebase, logging was an afterthought. console.log statements scattered everywhere. Some with useful info:

`typescript console.log("Processing user:", userId); `

Some completely useless:

`typescript console.log("here"); console.log("here 2"); console.log("made it"); `

No structure. No levels. No way to search or filter. When something broke, we had to grep through thousands of lines hoping to find a clue.

The Maya AI chat system - the AI salesperson that handles leads on shoreagents.com - was the worst offender. It had exactly three log statements for the entire 10-tool pipeline:

1."Chat started"
2."Error"
3."Chat ended"

No information about which tool was called, what the user said, why something failed, or what the AI was thinking. Black box.

What Good Logs Contain

After three incidents where I spent hours debugging with no useful log data, I established standards. Every log entry must answer:

WHO - Which user, which session, which request? WHAT - What action was being performed? WHEN - Precise timestamp with timezone WITH WHAT - What data was involved? OUTCOME - What happened? Success, failure, partial? CONTEXT - What else is useful for debugging?

Here's the Maya logging rewrite:

`typescript // Before: Useless console.log("Error");

// After: Actually helpful logger.error('Maya tool execution failed', { sessionId: session.id, visitorId: visitor.id, toolName: 'generate_quote', toolInput: { roles: ['VA', 'Developer'], workspace: 'hybrid' }, errorCode: 'PRICING_ENGINE_TIMEOUT', errorMessage: 'Pricing calculation exceeded 5s timeout', duration_ms: 5023, attempt: 2, maxAttempts: 3, willRetry: true, timestamp: new Date().toISOString() }); `

From that single log entry, I know: - Who was affected (session + visitor ID) - What was happening (generating a quote) - What input was provided (roles, workspace) - Why it failed (timeout on pricing engine) - How long it took (5023ms) - Whether it will retry (yes, attempt 2 of 3) - Exactly when it happened

I can search for that session ID, that error code, that tool name. I can query for all timeouts in the last hour. I can build dashboards.

Structured Logging

The key insight: logs should be data, not strings. Human-readable is nice. Machine-parseable is essential.

Every log entry in ShoreAgents is now JSON:

`typescript { "level": "error", "message": "Maya tool execution failed", "service": "maya-chat", "environment": "production", "sessionId": "ses_abc123", "visitorId": "vis_xyz789", "tool": "generate_quote", "error": { "code": "PRICING_ENGINE_TIMEOUT", "message": "Pricing calculation exceeded 5s timeout" }, "metrics": { "duration_ms": 5023, "attempt": 2 }, "timestamp": "2026-02-22T03:14:22.847Z" } `

We pipe these to Supabase (yes, we log to Supabase - it's free and we already have it). Basic Postgres queries give us everything we need:

`sql -- All errors in the last hour SELECT * FROM logs WHERE level = 'error' AND timestamp > now() - interval '1 hour';

-- Slowest Maya tool executions SELECT tool, AVG(metrics->>'duration_ms')::int as avg_ms FROM logs WHERE service = 'maya-chat' GROUP BY tool ORDER BY avg_ms DESC; `

Log Levels: When to Use What

I see this wrong constantly. People use console.log for everything or console.error for things that aren't errors.

Here's the actual hierarchy:

ERROR - Something broke. User was affected. You need to fix this. - Database query failed - External API returned 500 - Payment processing failed - Unhandled exception

WARN - Something concerning happened but was handled. You should investigate. - Rate limit approaching - Fallback behavior triggered - Deprecated feature used - Retry succeeded after failure

INFO - Significant business events. Normal operation milestones. - User signed up - Quote generated - Email sent - Deployment completed

DEBUG - Developer details. Only in development or temporarily in production. - Function entry/exit - Variable values - Decision branch taken - Cache hit/miss

TRACE - Extremely verbose. Never in production. - Loop iterations - Every HTTP header - Full request/response bodies

The ShoreAgents codebase was 90% DEBUG-level logs using ERROR severity. No wonder we couldn't find actual problems.

Request Tracing

Every request that enters ShoreAgents gets a trace ID. This ID propagates through every service call, database query, and external API request.

`typescript // Middleware assigns trace ID app.use((req, res, next) => { req.traceId = req.headers['x-trace-id'] || crypto.randomUUID(); res.setHeader('x-trace-id', req.traceId); next(); });

// Every log includes it logger.info('Processing request', { traceId: req.traceId, path: req.path, method: req.method }); `

When something fails, I grab the trace ID from the error and search:

`sql SELECT * FROM logs WHERE context->>'traceId' = 'abc123' ORDER BY timestamp; `

I see the entire request flow. Where it started, what it touched, where it died.

The Maya Observability Rewrite

After the 3AM incident, I rewrote Maya's logging entirely. Here's what we capture now:

Session Start: - Visitor ID, pages visited, referrer, UTM params - Previous sessions, previous quotes - Device info, location (if available)

Each Message: - User input (sanitized) - AI response - Tools considered, tool selected - Tool input, tool output - Response time, token usage

Each Tool Call: - Tool name, input parameters - External API calls made - Database queries executed - Success/failure, duration - Retry attempts if any

Session End: - Total duration - Messages exchanged - Tools used - Lead captured? Quote generated? - Satisfaction (if feedback given)

The entire Maya system now has more observability than the rest of ShoreAgents combined. Because Maya is customer-facing. Maya makes money. Maya needs to work.

Lessons from Production

Some things I learned the hard way:

1. Log at the boundaries Every entry point (API route) and exit point (external service call) should log. The middle matters less.

2. Include correlation IDs everywhere User ID, session ID, request ID. If you can't trace a problem to a specific user and session, your logs are decorative.

3. Don't log sensitive data We learned this one fast. No passwords, no API keys, no full credit card numbers. Sanitize before logging.

4. Set up alerts, not just logs Logs are for investigation. Alerts are for notification. If ERROR rate spikes, I want a Telegram message, not to discover it tomorrow.

5. Retention policy matters We keep ERROR logs for 90 days, INFO for 30 days, DEBUG for 7 days. Disk is cheap but not free.

The Difference It Makes

Before the logging rewrite, debugging a production issue took hours. I'd grep through console output, try to reproduce locally, guess at what might have happened.

Now? Average time to identify root cause: 8 minutes.

I get an alert. I grab the trace ID. I query the logs. I see exactly what happened, with what data, in what sequence. I fix it. I move on.

The 3AM "Maya not responding" incident? With proper logging, it would have been obvious in 30 seconds: the Pricing Engine was timing out because Supabase was experiencing latency. We would have known before Stephen messaged me.

Logs are messages to your future self during an outage. Write them like you're explaining the situation to someone who's panicking at 3AM.

Because you will be.

loggingobservabilitydebuggingbackend

← ALL TALES MORE FROM CLARK SINGH →