Logging in Production: A Practical Guide to Observability That Actually Works
Why most logging is useless, and how to build observability that helps you debug production issues in minutes instead of hours. From structured logs to distributed tracing, with real examples.
3 AM. My phone buzzes with a PagerDuty alert. The checkout flow is failing and customers cannot complete purchases. I open the logs. 47,000 log lines from the last hour. I scroll through pages of "Request received", "Processing...", "Done", "Request received", "Processing...", "Done".
Somewhere in that noise is the reason our revenue stream just broke. But finding it? That is like searching for a specific grain of sand on a beach.
Two hours later, after manually correlating timestamps across three services, I found it. A null reference in the payment service. The fix was one line. But the debugging? That cost me sleep, stress, and the trust of customers who were trying to give us money.
I swore that night: never again. Logging should help you debug, not hide the problem in a wall of noise.
This guide is everything I have learned since then about building observability that actually works.
The Detective Analogy
Imagine you are a detective investigating a crime. You have two options for evidence.
Option A: Scattered notes. Random scraps of paper all over the city. A witness statement here, a receipt there, a half-remembered conversation somewhere else. Each clue exists in isolation. You spend hours just collecting them, then more hours trying to figure out how they connect.
Option B: A complete timeline. Every event in chronological order. Receipts matched to the people who made them. Witness statements placed at the exact moment they happened. You can see the whole story unfold with a single glance.
Most production logging is Option A. Good observability is Option B.
The difference is not just convenience. It is the difference between solving a case in an hour versus never solving it at all.
Why Logging Actually Matters
Before we get into how to log, let us talk about why we bother. Because if you do not understand the value, you will not invest in doing it right.
Visibility Into the Black Box
Once your code deploys to production, it becomes a black box. You cannot step through it with a debugger. You cannot add print statements and redeploy when something breaks. Your logs are your only window into what is happening.
Without good logs, you are flying blind. With good logs, you can see exactly what happened, when it happened, and why.
Time Travel Debugging
Logs let you debug retroactively. When a bug report comes in at 2 PM about something that happened at 9 AM, you do not need to reproduce it. You can read the logs and see the exact sequence of events that led to the failure.
This is incredibly powerful. Instead of guessing, you can know.
Early Warning System
Good logs do not just help you debug failures. They help you spot problems before they become failures. A spike in error rates, unusual latency patterns, strange traffic patterns — all of these show up in logs before they show up in angry customer emails.
Compliance and Audit Trails
Sometimes you need logs for legal or compliance reasons. Audit trails of who accessed what data, when payments were processed, what changes were made to critical records. Good logging practices make compliance a side effect, not a chore.
Three Ways Logging Fails
I have seen logging fail in three predictable ways. Understanding them helps you avoid them.
The Silent Treatment
Your service is failing but your logs are empty. Or they contain only "Error occurred" with no context about what error, where, or why.
This usually happens when:
- Errors are swallowed by catch blocks that log nothing
- Logging is only enabled at INFO level in production
- Critical context is not captured because "it was working fine in dev"
The result: you know something is wrong but you have no information to fix it.
The Wall of Noise
The opposite problem. Millions of log lines that contain everything and nothing. DEBUG-level logs in production. Repeated status messages. Logs that tell you a request was received but not what happened to it.
The result: the signal is buried in noise. You know the information exists somewhere, but finding it takes hours.
The Cryptic Message
Logs that contain information, but in a format that is impossible to parse. Unstructured strings like "User action failed". Timestamps in different formats across services. Missing correlation IDs so you cannot trace a request across multiple services.
The result: the information is there but using it requires manual correlation and guesswork.
Structured Logging: The Foundation
Here is the single most important change you can make: stop logging strings, start logging structured data.
The Bad Way: String Logs
// Bad: Unstructured string logging
console.log(`User ${userId} failed to update profile`);
console.log("Payment failed");
console.log("Request took too long");
These logs are human-readable but machine-hostile. Try to:
- Find all failed payments in the last hour
- Calculate average response times
- Alert when error rate exceeds 5%
You cannot do any of these efficiently with string logs. You would need regex parsing, which is fragile and slow.
The Good Way: Structured JSON Logs
// Good: Structured JSON logging
logger.info({
event: "payment_processed",
userId: "usr_abc123",
orderId: "ord_xyz789",
amount: 99.99,
currency: "USD",
status: "success",
duration_ms: 245,
traceId: "trace_abc123xyz789"
});
logger.error({
event: "payment_failed",
userId: "usr_abc123",
orderId: "ord_xyz789",
amount: 99.99,
error: "card_declined",
errorCode: "INSUFFICIENT_FUNDS",
gatewayResponse: { code: "05", message: "Do not honor" },
duration_ms: 120,
traceId: "trace_abc123xyz789"
});
These logs contain the same information but in a queryable format. Now you can:
- Filter by
event: "payment_failed" - Group by
errorCodeto see the most common failures - Calculate average
duration_msfor successful vs failed payments - Follow
traceIdacross services
What to Include in Every Log
Think of logs like a news story. Every log should answer:
- Who: userId, sessionId, traceId
- What: event name, action performed
- Where: service name, environment, host
- When: timestamp (always use UTC)
- Why (for errors): error message, stack trace, context
Here is a template I use:
const baseLog = {
timestamp: new Date().toISOString(),
service: "payment-service",
environment: process.env.NODE_ENV,
host: os.hostname(),
traceId: context.traceId,
userId: context.userId,
event: "payment_initiated"
};
logger.info({
...baseLog,
orderId: order.id,
amount: order.amount,
paymentMethod: order.paymentMethod
});
Distributed Tracing: Following the Trail
In a monolith, a single request stays in one codebase. In microservices, a single user action might touch five different services. How do you follow the story across all of them?
The Problem
User → API Gateway → Auth Service → Order Service → Payment Service → Inventory Service
Each service logs independently. Without a way to connect them, you are back to the detective with scattered notes.
The Solution: Trace IDs
A trace ID is a unique identifier that flows through the entire request chain. Every service that handles the request includes the same trace ID in its logs.
// API Gateway generates or receives trace ID
const traceId = req.headers['x-trace-id'] || generateTraceId();
// Pass it to all downstream services
const authResponse = await fetch('https://auth.internal/verify', {
headers: { 'x-trace-id': traceId }
});
const orderResponse = await fetch('https://orders.internal/create', {
headers: { 'x-trace-id': traceId }
});
// Log with trace ID
logger.info({
event: "order_created",
traceId: traceId,
userId: user.id,
orderId: order.id
});
Now, when you search logs for traceId: "abc123xyz789", you get the complete story from gateway to database.
Spans and Timing
Traces can also capture timing. Each operation within a trace is a "span" with a start and end time. This shows you where time is being spent.
// Using OpenTelemetry for tracing
const span = tracer.startSpan('process-payment');
try {
span.setAttribute('payment.amount', amount);
span.setAttribute('payment.method', method);
const result = await processPayment(amount, method);
span.setAttribute('payment.status', 'success');
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.setAttribute('payment.status', 'failed');
span.setAttribute('error.message', error.message);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
throw error;
} finally {
span.end();
}
This creates a timeline showing exactly which operation took how long. Suddenly "the request is slow" becomes "the payment gateway call takes 2.5 seconds".
Log Levels: Knowing What to Log When
Not all information is equally important. Log levels help you filter noise from signal.
The Standard Levels
| Level | When to Use | Production Setting |
|---|---|---|
| DEBUG | Detailed diagnostic info | Usually off |
| INFO | Normal operation events | On |
| WARN | Something unexpected but handled | On |
| ERROR | Failures that affect operation | On |
| FATAL | System cannot continue | On |
DEBUG: The Development Helper
logger.debug({
event: "cache_lookup",
key: cacheKey,
hit: false,
ttl_seconds: 300
});
DEBUG logs are for development and deep troubleshooting. They tell you exactly what is happening step by step. In production, these are usually disabled to avoid noise. When you need them, you can temporarily enable DEBUG for a specific service or user.
INFO: The Story of Normal Operations
logger.info({
event: "user_login",
userId: "usr_abc123",
method: "password",
ip: "203.0.113.42",
userAgent: "Mozilla/5.0..."
});
INFO logs tell the story of what is happening in your system. User logins, orders placed, emails sent. These should be relatively sparse — one log per significant event, not per function call.
WARN: The Head's Up
logger.warn({
event: "payment_retry",
orderId: "ord_xyz789",
attempt: 2,
maxAttempts: 3,
previousError: "gateway_timeout",
nextRetryIn_ms: 5000
});
WARN means something unexpected happened, but the system handled it. A retry was triggered, a deprecated API was used, a resource is running low. These are not failures yet, but they might become failures if trends continue.
ERROR: Something Broke
logger.error({
event: "payment_failed",
orderId: "ord_xyz789",
error: "database_connection_timeout",
errorCode: "DB_CONN_TIMEOUT",
stack: error.stack,
context: {
query: "UPDATE orders SET status = ?",
params: ["paid"]
}
});
ERROR means something failed that should not have failed. A database connection dropped, an external API returned 500, a required field was null. These need human attention.
Key rule: every ERROR log should be actionable. If you cannot do anything about it, it is probably a WARN or INFO.
FATAL: Stop Everything
logger.fatal({
event: "database_unavailable",
error: "Cannot connect to primary database",
attempts: 5,
action: "shutting_down"
});
FATAL means the system cannot continue. Database is down, required configuration is missing, critical dependencies are unavailable. These usually trigger immediate alerts and process restarts.
What to Log: Practical Guidelines
Now that we know how to log, let us talk about what deserves to be logged.
Request and Response Logging
Log the entry and exit of significant operations:
// Entry log
logger.info({
event: "request_started",
method: req.method,
path: req.path,
userId: req.user?.id,
traceId: req.traceId,
query: req.query, // Be careful with sensitive data
bodySize: req.body ? JSON.stringify(req.body).length : 0
});
// ... process request ...
// Exit log
logger.info({
event: "request_completed",
method: req.method,
path: req.path,
statusCode: res.statusCode,
duration_ms: Date.now() - startTime,
traceId: req.traceId
});
But never log sensitive data like passwords, credit card numbers, or PII without hashing or masking:
// Bad
logger.info({ body: req.body }); // Might contain passwords!
// Good
logger.info({
body: {
...req.body,
password: '[REDACTED]',
creditCard: maskCard(req.body.creditCard) // Shows ****-****-****-1234
}
});
Business Events
Log things that matter to your business:
logger.info({
event: "order_placed",
orderId: order.id,
userId: user.id,
items: order.items.map(i => ({
sku: i.sku,
quantity: i.quantity,
price: i.price
})),
total: order.total,
currency: order.currency,
paymentMethod: order.paymentMethod
});
logger.info({
event: "user_signup_completed",
userId: user.id,
signupSource: attribution.source,
utm_campaign: attribution.campaign,
timeToComplete_ms: Date.now() - session.startTime
});
These logs become business intelligence. You can analyze conversion funnels, track feature usage, measure performance of different payment methods.
System Health
Log operational metrics:
logger.info({
event: "health_check",
status: "healthy",
checks: {
database: { status: "up", latency_ms: 12 },
redis: { status: "up", latency_ms: 3 },
external_api: { status: "up", latency_ms: 145 }
},
memoryUsage: process.memoryUsage(),
uptime_seconds: process.uptime()
});
Security Events
Log security-relevant actions:
logger.warn({
event: "login_failed",
userId: attemptedUserId,
reason: "invalid_password",
ip: req.ip,
userAgent: req.headers['user-agent'],
attemptNumber: await getFailedAttempts(attemptedUserId)
});
logger.info({
event: "permission_denied",
userId: user.id,
attemptedAction: "delete_order",
targetOrderId: orderId,
requiredRole: "admin",
actualRole: user.role
});
Debugging Workflow: Using Logs Effectively
Having good logs is step one. Knowing how to use them is step two.
The Standard Debugging Flow
When an alert fires or a bug is reported, here is my workflow:
1. Identify the timeframe
-- In your log aggregator (Datadog, Splunk, etc.)
timestamp > "2026-04-04T08:00:00Z" AND timestamp < "2026-04-04T09:00:00Z"
2. Find the error
level:ERROR AND service:payment-service
3. Get the trace ID
From the error log, extract the traceId.
4. Follow the trace
traceId:"abc123xyz789"
This shows the complete request flow across all services.
5. Look for patterns
errorCode:"DB_CONN_TIMEOUT" AND timestamp > "2026-04-04T08:00:00Z"
Is this affecting multiple users or just one? Is the error rate increasing?
6. Check context
Look at logs immediately before the error:
traceId:"abc123xyz789" AND timestamp < error_timestamp
What was happening right before the failure?
Useful Queries to Bookmark
Here are queries I keep saved in my log aggregator:
Error rate by service:
level:ERROR | stats count by service
Slowest endpoints:
event:request_completed | stats avg(duration_ms) by path
Failed payments by reason:
event:payment_failed | stats count by errorCode
Users with most errors:
level:ERROR | stats count by userId
Quick Reference Checklist
Here is a checklist you can use to audit your current logging or set up new services:
Structure
- Logs are structured (JSON) not strings
- Every log includes timestamp, service name, and environment
- Error logs include stack traces
- Logs have consistent field names across services
Context
- Every request has a trace ID
- Trace IDs flow through all service calls
- User IDs are included where relevant
- Correlation IDs link related operations
Content
- Entry and exit of significant operations are logged
- Business events are captured
- Errors include actionable context
- Sensitive data is masked or redacted
- Log levels are appropriate (no DEBUG spam in production)
Usability
- Logs can be searched by trace ID
- Common queries are saved/bookmarked
- Alerts are set up for ERROR and FATAL logs
- Dashboards show key metrics derived from logs
Performance
- Logging does not block request handling
- High-volume logs use async/batched writes
- Log retention policy is defined
- Log storage costs are monitored
Key Takeaways
-
Logs are your production debugger. Once code ships, they are your only window into what is happening.
-
Structure beats strings. JSON logs can be queried, aggregated, and analyzed. String logs cannot.
-
Context is everything. A log without a trace ID, user ID, and timestamp is nearly useless.
-
Levels filter noise. DEBUG for development, INFO for the story, WARN for heads-up, ERROR for action.
-
Trace IDs connect the dots. In distributed systems, tracing is not optional. It is the difference between debugging in minutes versus never.
-
Log events, not functions. Users placed orders, not
validateUser()was called. -
Make errors actionable. If you cannot do anything about it, do not log it as an error.
-
Protect sensitive data. One leaked password in logs is a security incident.
Good logging does not just help you debug faster. It changes debugging from an art of guesswork into a science of investigation. From 2 AM scrolling through noise to 5 minute trace analysis.
That 3 AM debugging session I mentioned at the start? With proper logging, it would have been a 5-minute fix. The trace ID would have led me directly to the payment service. The structured error log would have shown the null reference immediately. The context would have told me exactly which order and which user was affected.
That is the power of observability done right. It turns chaos into clarity.
Further Reading
- The Twelve-Factor App: Logs - Treat logs as event streams
- Google Cloud Logging Best Practices - Structured logging guidelines
- OpenTelemetry Documentation - Industry standard for tracing
- Datadog Logging Best Practices - Practical logging advice
- Splunk Logging Best Practices - Enterprise logging patterns
Comments
0Loading comments...
Related Articles
Load Balancing Explained: What It Is, How It Works, and When You Need It
A complete guide to understanding load balancers. Learn the techniques, algorithms, and tools to distribute traffic across servers effectively. From DNS load balancing to Caddy and building your own.
How Video Calling Actually Works: WebRTC and the Architecture Behind Real-Time Communication
A deep dive into how apps like WhatsApp, Discord, and Zoom handle video calls. Learn how WebRTC works, what signaling and ICE servers do, and understand the differences between Peer-to-Peer, Mesh, MCU, and SFU architectures.
Webhooks Explained: What They Are, When to Use Them, and How to Build Them Right
The complete guide to understanding webhooks. Learn how they work, when to use them, and the production patterns that prevent data loss. From basic concepts to transactional outbox patterns, with practical testing strategies.