Logging in Production: A Practical Guide to Observability That Actually Works

3 AM. My phone buzzes with a PagerDuty alert. The checkout flow is failing and customers cannot complete purchases. I open the logs. 47,000 log lines from the last hour. I scroll through pages of "Request received", "Processing...", "Done", "Request received", "Processing...", "Done".

Somewhere in that noise is the reason our revenue stream just broke. But finding it? That is like searching for a specific grain of sand on a beach.

Two hours later, after manually correlating timestamps across three services, I found it. A null reference in the payment service. The fix was one line. But the debugging? That cost me sleep, stress, and the trust of customers who were trying to give us money.

I swore that night: never again. Logging should help you debug, not hide the problem in a wall of noise.

This guide is everything I have learned since then about building observability that actually works.

The Detective Analogy

Imagine you are a detective investigating a crime. You have two options for evidence.

Option A: Scattered notes. Random scraps of paper all over the city. A witness statement here, a receipt there, a half-remembered conversation somewhere else. Each clue exists in isolation. You spend hours just collecting them, then more hours trying to figure out how they connect.

Option B: A complete timeline. Every event in chronological order. Receipts matched to the people who made them. Witness statements placed at the exact moment they happened. You can see the whole story unfold with a single glance.

Most production logging is Option A. Good observability is Option B.

The difference is not just convenience. It is the difference between solving a case in an hour versus never solving it at all.

Why Logging Actually Matters

Before we get into how to log, let us talk about why we bother. Because if you do not understand the value, you will not invest in doing it right.

Visibility Into the Black Box

Once your code deploys to production, it becomes a black box. You cannot step through it with a debugger. You cannot add print statements and redeploy when something breaks. Your logs are your only window into what is happening.

Without good logs, you are flying blind. With good logs, you can see exactly what happened, when it happened, and why.

Time Travel Debugging

Logs let you debug retroactively. When a bug report comes in at 2 PM about something that happened at 9 AM, you do not need to reproduce it. You can read the logs and see the exact sequence of events that led to the failure.

This is incredibly powerful. Instead of guessing, you can know.

Early Warning System

Good logs do not just help you debug failures. They help you spot problems before they become failures. A spike in error rates, unusual latency patterns, strange traffic patterns — all of these show up in logs before they show up in angry customer emails.

Compliance and Audit Trails

Sometimes you need logs for legal or compliance reasons. Audit trails of who accessed what data, when payments were processed, what changes were made to critical records. Good logging practices make compliance a side effect, not a chore.

Three Ways Logging Fails

I have seen logging fail in three predictable ways. Understanding them helps you avoid them.

The Silent Treatment

Your service is failing but your logs are empty. Or they contain only "Error occurred" with no context about what error, where, or why.

This usually happens when:

Errors are swallowed by catch blocks that log nothing
Logging is only enabled at INFO level in production
Critical context is not captured because "it was working fine in dev"

The result: you know something is wrong but you have no information to fix it.

The Wall of Noise

The opposite problem. Millions of log lines that contain everything and nothing. DEBUG-level logs in production. Repeated status messages. Logs that tell you a request was received but not what happened to it.

The result: the signal is buried in noise. You know the information exists somewhere, but finding it takes hours.

The Cryptic Message

Logs that contain information, but in a format that is impossible to parse. Unstructured strings like "User action failed". Timestamps in different formats across services. Missing correlation IDs so you cannot trace a request across multiple services.

The result: the information is there but using it requires manual correlation and guesswork.

Structured Logging: The Foundation

Here is the single most important change you can make: stop logging strings, start logging structured data.

The Bad Way: String Logs

// Bad: Unstructured string logging
console.log(`User ${userId} failed to update profile`);
console.log("Payment failed");
console.log("Request took too long");

These logs are human-readable but machine-hostile. Try to:

Find all failed payments in the last hour
Calculate average response times
Alert when error rate exceeds 5%

You cannot do any of these efficiently with string logs. You would need regex parsing, which is fragile and slow.

The Good Way: Structured JSON Logs

// Good: Structured JSON logging
logger.info({
  event: "payment_processed",
  userId: "usr_abc123",
  orderId: "ord_xyz789",
  amount: 99.99,
  currency: "USD",
  status: "success",
  duration_ms: 245,
  traceId: "trace_abc123xyz789"
});

logger.error({
  event: "payment_failed",
  userId: "usr_abc123",
  orderId: "ord_xyz789",
  amount: 99.99,
  error: "card_declined",
  errorCode: "INSUFFICIENT_FUNDS",
  gatewayResponse: { code: "05", message: "Do not honor" },
  duration_ms: 120,
  traceId: "trace_abc123xyz789"
});

These logs contain the same information but in a queryable format. Now you can:

Filter by event: "payment_failed"
Group by errorCode to see the most common failures
Calculate average duration_ms for successful vs failed payments
Follow traceId across services

What to Include in Every Log

Think of logs like a news story. Every log should answer:

Who: userId, sessionId, traceId
What: event name, action performed
Where: service name, environment, host
When: timestamp (always use UTC)
Why (for errors): error message, stack trace, context

Here is a template I use:

const baseLog = {
  timestamp: new Date().toISOString(),
  service: "payment-service",
  environment: process.env.NODE_ENV,
  host: os.hostname(),
  traceId: context.traceId,
  userId: context.userId,
  event: "payment_initiated"
};

logger.info({
  ...baseLog,
  orderId: order.id,
  amount: order.amount,
  paymentMethod: order.paymentMethod
});

Distributed Tracing: Following the Trail

In a monolith, a single request stays in one codebase. In microservices, a single user action might touch five different services. How do you follow the story across all of them?

The Problem

User → API Gateway → Auth Service → Order Service → Payment Service → Inventory Service

Each service logs independently. Without a way to connect them, you are back to the detective with scattered notes.

The Solution: Trace IDs

A trace ID is a unique identifier that flows through the entire request chain. Every service that handles the request includes the same trace ID in its logs.

// API Gateway generates or receives trace ID
const traceId = req.headers['x-trace-id'] || generateTraceId();

// Pass it to all downstream services
const authResponse = await fetch('https://auth.internal/verify', {
  headers: { 'x-trace-id': traceId }
});

const orderResponse = await fetch('https://orders.internal/create', {
  headers: { 'x-trace-id': traceId }
});

// Log with trace ID
logger.info({
  event: "order_created",
  traceId: traceId,
  userId: user.id,
  orderId: order.id
});

Now, when you search logs for traceId: "abc123xyz789", you get the complete story from gateway to database.

Spans and Timing

Traces can also capture timing. Each operation within a trace is a "span" with a start and end time. This shows you where time is being spent.

// Using OpenTelemetry for tracing
const span = tracer.startSpan('process-payment');

try {
  span.setAttribute('payment.amount', amount);
  span.setAttribute('payment.method', method);
  
  const result = await processPayment(amount, method);
  
  span.setAttribute('payment.status', 'success');
  span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
  span.setAttribute('payment.status', 'failed');
  span.setAttribute('error.message', error.message);
  span.setStatus({ 
    code: SpanStatusCode.ERROR, 
    message: error.message 
  });
  throw error;
} finally {
  span.end();
}

This creates a timeline showing exactly which operation took how long. Suddenly "the request is slow" becomes "the payment gateway call takes 2.5 seconds".

Log Levels: Knowing What to Log When

Not all information is equally important. Log levels help you filter noise from signal.

The Standard Levels

Level	When to Use	Production Setting
DEBUG	Detailed diagnostic info	Usually off
INFO	Normal operation events	On
WARN	Something unexpected but handled	On
ERROR	Failures that affect operation	On
FATAL	System cannot continue	On

DEBUG: The Development Helper

logger.debug({
  event: "cache_lookup",
  key: cacheKey,
  hit: false,
  ttl_seconds: 300
});

DEBUG logs are for development and deep troubleshooting. They tell you exactly what is happening step by step. In production, these are usually disabled to avoid noise. When you need them, you can temporarily enable DEBUG for a specific service or user.

INFO: The Story of Normal Operations

logger.info({
  event: "user_login",
  userId: "usr_abc123",
  method: "password",
  ip: "203.0.113.42",
  userAgent: "Mozilla/5.0..."
});

INFO logs tell the story of what is happening in your system. User logins, orders placed, emails sent. These should be relatively sparse — one log per significant event, not per function call.

WARN: The Head's Up

logger.warn({
  event: "payment_retry",
  orderId: "ord_xyz789",
  attempt: 2,
  maxAttempts: 3,
  previousError: "gateway_timeout",
  nextRetryIn_ms: 5000
});

WARN means something unexpected happened, but the system handled it. A retry was triggered, a deprecated API was used, a resource is running low. These are not failures yet, but they might become failures if trends continue.

ERROR: Something Broke

logger.error({
  event: "payment_failed",
  orderId: "ord_xyz789",
  error: "database_connection_timeout",
  errorCode: "DB_CONN_TIMEOUT",
  stack: error.stack,
  context: {
    query: "UPDATE orders SET status = ?",
    params: ["paid"]
  }
});

ERROR means something failed that should not have failed. A database connection dropped, an external API returned 500, a required field was null. These need human attention.

Key rule: every ERROR log should be actionable. If you cannot do anything about it, it is probably a WARN or INFO.

FATAL: Stop Everything

logger.fatal({
  event: "database_unavailable",
  error: "Cannot connect to primary database",
  attempts: 5,
  action: "shutting_down"
});

FATAL means the system cannot continue. Database is down, required configuration is missing, critical dependencies are unavailable. These usually trigger immediate alerts and process restarts.

What to Log: Practical Guidelines

Now that we know how to log, let us talk about what deserves to be logged.

Request and Response Logging

Log the entry and exit of significant operations:

// Entry log
logger.info({
  event: "request_started",
  method: req.method,
  path: req.path,
  userId: req.user?.id,
  traceId: req.traceId,
  query: req.query,  // Be careful with sensitive data
  bodySize: req.body ? JSON.stringify(req.body).length : 0
});

// ... process request ...

// Exit log
logger.info({
  event: "request_completed",
  method: req.method,
  path: req.path,
  statusCode: res.statusCode,
  duration_ms: Date.now() - startTime,
  traceId: req.traceId
});

But never log sensitive data like passwords, credit card numbers, or PII without hashing or masking:

// Bad
logger.info({ body: req.body });  // Might contain passwords!

// Good
logger.info({
  body: {
    ...req.body,
    password: '[REDACTED]',
    creditCard: maskCard(req.body.creditCard)  // Shows ****-****-****-1234
  }
});

Business Events

Log things that matter to your business:

logger.info({
  event: "order_placed",
  orderId: order.id,
  userId: user.id,
  items: order.items.map(i => ({
    sku: i.sku,
    quantity: i.quantity,
    price: i.price
  })),
  total: order.total,
  currency: order.currency,
  paymentMethod: order.paymentMethod
});

logger.info({
  event: "user_signup_completed",
  userId: user.id,
  signupSource: attribution.source,
  utm_campaign: attribution.campaign,
  timeToComplete_ms: Date.now() - session.startTime
});

These logs become business intelligence. You can analyze conversion funnels, track feature usage, measure performance of different payment methods.

System Health

Log operational metrics:

logger.info({
  event: "health_check",
  status: "healthy",
  checks: {
    database: { status: "up", latency_ms: 12 },
    redis: { status: "up", latency_ms: 3 },
    external_api: { status: "up", latency_ms: 145 }
  },
  memoryUsage: process.memoryUsage(),
  uptime_seconds: process.uptime()
});

Security Events

Log security-relevant actions:

logger.warn({
  event: "login_failed",
  userId: attemptedUserId,
  reason: "invalid_password",
  ip: req.ip,
  userAgent: req.headers['user-agent'],
  attemptNumber: await getFailedAttempts(attemptedUserId)
});

logger.info({
  event: "permission_denied",
  userId: user.id,
  attemptedAction: "delete_order",
  targetOrderId: orderId,
  requiredRole: "admin",
  actualRole: user.role
});

Debugging Workflow: Using Logs Effectively

Having good logs is step one. Knowing how to use them is step two.

The Standard Debugging Flow

When an alert fires or a bug is reported, here is my workflow:

1. Identify the timeframe

-- In your log aggregator (Datadog, Splunk, etc.)
timestamp > "2026-04-04T08:00:00Z" AND timestamp < "2026-04-04T09:00:00Z"

2. Find the error

level:ERROR AND service:payment-service

3. Get the trace ID

From the error log, extract the traceId.

4. Follow the trace

traceId:"abc123xyz789"

This shows the complete request flow across all services.

5. Look for patterns

errorCode:"DB_CONN_TIMEOUT" AND timestamp > "2026-04-04T08:00:00Z"

Is this affecting multiple users or just one? Is the error rate increasing?

6. Check context

Look at logs immediately before the error:

traceId:"abc123xyz789" AND timestamp < error_timestamp

What was happening right before the failure?

Useful Queries to Bookmark

Here are queries I keep saved in my log aggregator:

Error rate by service:

level:ERROR | stats count by service

Slowest endpoints:

event:request_completed | stats avg(duration_ms) by path

Failed payments by reason:

event:payment_failed | stats count by errorCode

Users with most errors:

level:ERROR | stats count by userId

Quick Reference Checklist

Here is a checklist you can use to audit your current logging or set up new services:

Structure

Logs are structured (JSON) not strings
Every log includes timestamp, service name, and environment
Error logs include stack traces
Logs have consistent field names across services

Context

Every request has a trace ID
Trace IDs flow through all service calls
User IDs are included where relevant
Correlation IDs link related operations

Content

Entry and exit of significant operations are logged
Business events are captured
Errors include actionable context
Sensitive data is masked or redacted
Log levels are appropriate (no DEBUG spam in production)

Usability

Logs can be searched by trace ID
Common queries are saved/bookmarked
Alerts are set up for ERROR and FATAL logs
Dashboards show key metrics derived from logs

Performance

Logging does not block request handling
High-volume logs use async/batched writes
Log retention policy is defined
Log storage costs are monitored

Key Takeaways

Logs are your production debugger. Once code ships, they are your only window into what is happening.
Structure beats strings. JSON logs can be queried, aggregated, and analyzed. String logs cannot.
Context is everything. A log without a trace ID, user ID, and timestamp is nearly useless.
Levels filter noise. DEBUG for development, INFO for the story, WARN for heads-up, ERROR for action.
Trace IDs connect the dots. In distributed systems, tracing is not optional. It is the difference between debugging in minutes versus never.
Log events, not functions. Users placed orders, not validateUser() was called.
Make errors actionable. If you cannot do anything about it, do not log it as an error.
Protect sensitive data. One leaked password in logs is a security incident.

Good logging does not just help you debug faster. It changes debugging from an art of guesswork into a science of investigation. From 2 AM scrolling through noise to 5 minute trace analysis.

That 3 AM debugging session I mentioned at the start? With proper logging, it would have been a 5-minute fix. The trace ID would have led me directly to the payment service. The structured error log would have shown the null reference immediately. The context would have told me exactly which order and which user was affected.

That is the power of observability done right. It turns chaos into clarity.