Skip to main content
Back to Blog

Still Relying on Restart Magic for Troubleshooting? Master Cloud-Native Monitoring with LGTM Stack

3/11/2026

In todayโ€™s world where distributed microservice architectures have become mainstream, system complexity is growing exponentially. When online issues occur, are you still relying on โ€œrestart magicโ€ and โ€œlog diggingโ€ to locate problems? When your service count grows from 10 to 100, are you still using traditional monitoring methods? Itโ€™s time to build a modern observability framework.


I. Monitoring Challenges and Solutions in the Cloud-Native Era

1.1 Limitations of Traditional Monitoring

Remember the monolithic application era? One application, one database, one log fileโ€”when something went wrong, you could locate the issue just by checking the logs. But what about now?

Real-world scenario:

  • User reports โ€œdevice list loading is slowโ€
  • You check Gateway Service logs โ†’ Normal
  • You check Device Service logs โ†’ A bit slow but not obvious
  • You check database logs โ†’ Connection pool normal
  • You check Redis logs โ†’ No problem either
  • Finally discovered: Kafka consumer lag caused Data Service slow response, affecting the entire call chain

This is the black box dilemma of distributed systems: a single request may span 5-10 services, and traditional monitoring methods simply cannot track the complete call chain.

1.2 What is Observability?

Observability refers to the ability to infer a systemโ€™s internal state through its external outputs. Unlike traditional monitoring, observability emphasizes proactive issue detection and rapid root cause localization.

Observability is built on three pillars:

PillarPurposeProblem SolvedTypical Tools
MetricsOverall view of system statusโ€How is the system doing now?โ€Prometheus + Grafana
LogsDetailed event recordsโ€What exactly happened?โ€Loki / ELK
TracesComplete call chain of requestsโ€Which services did the request go through? Where was time spent?โ€Tempo / Jaeger / SkyWalking

How do the three pillars work together?

  1. Metrics discover issues: Dashboard reveals โ€œdevice list API P95 latency spiked to 5 secondsโ€
  2. Traces locate bottlenecks: Distributed tracing reveals โ€œ80% time spent on database queryโ€
  3. Logs show details: Logs reveal โ€œdatabase connection pool wait alert, connections exhaustedโ€

II. LGTM Stack: The De Facto Standard for Cloud-Native Monitoring

2.1 What is LGTM Stack?

LGTM Stack is currently the most popular open-source observability solution, composed of four core components:

  • Loki - Log aggregation system
  • Grafana - Unified visualization platform
  • Tempo - Distributed tracing
  • Prometheus๏ผˆM๏ผ‰ - Metric collection and storage

Note: Although the acronym is LGTM, components are typically learned and used in the order: Prometheus โ†’ Loki โ†’ Tempo โ†’ Grafana.

2.2 Why Choose LGTM Stack?

AspectLGTM StackTraditional Solutions (e.g., ELK + Zipkin)
CostLow (resource consumption only 10-20% of ELK)High
Deployment ComplexityLow (cloud-native design)High
Learning CurveGentle (unified query style)Steep (multiple query languages)
IntegrationHigh (Grafana unified visualization)Low (multiple independent systems)
Use CasesCloud-native, Kubernetes, microservicesTraditional architectures

III. Deep Dive into LGTM Components

3.1 Prometheus: Metric Collection and Storage

Core Function

Prometheus is an open-source system monitoring and alerting tool, primarily responsible for collecting, storing, and querying time-series metric data.

How It Works

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Application   โ”‚ Exposes /metrics endpoint
โ”‚  (Spring Boot)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ Pull mode (every 15 seconds)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Prometheus    โ”‚ Collects and stores metrics
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ PromQL query
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     Grafana     โ”‚ Visualization
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Features

  1. Pull-based collection: Prometheus actively pulls metrics from target services (default 15 seconds)
  2. Multi-dimensional data model: Flexible querying through labels
    # Example: Query request rate grouped by service and status code
    http_requests_total{method="GET", status="200"}
    
  3. PromQL query language: Powerful time-series data querying capability
  4. Alert management: Rule-based alerting and notification

Four Metric Types

Metric TypeCharacteristicsUse CasesExample
CounterMonotonically increasingCumulative values (total requests, total errors)http_requests_total
GaugeCan go up or downInstantaneous values (current memory, current connections)memory_usage_bytes
HistogramDistribution statisticsLatency distribution, request size distributionhttp_request_duration_seconds
SummaryQuantile statisticsPre-calculated quantilesrequest_duration_seconds{quantile="0.95"}

Common PromQL Queries

# 1. Query request rate over last 5 minutes
rate(http_requests_total[5m])

# 2. Query P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 3. Query error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# 4. Group by service
sum by (service) (rate(http_requests_total[5m]))

3.2 Loki: Lightweight Log Aggregation

Core Function

Loki is Grafana Labsโ€™ open-source log aggregation system, specialized in storing and querying logs. Its design is inspired by Prometheus, using labels to index logs.

Differences from ELK

AspectLokiELK (Elasticsearch)
StorageIndexes labels only, not log contentFull-text indexing
Resource UsageLow (only 10-20% of ELK)High
Query LanguageLogQL (similar to PromQL)Lucene
Use CasesCloud-native, KubernetesComplex log analysis

Why choose Loki?

  • โœ… Deep integration with Prometheus and Grafana
  • โœ… Low resource consumption, suitable for small-to-medium deployments
  • โœ… Simple operation, gentle learning curve
  • โœ… Multi-tenant isolation support

How It Works

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Application   โ”‚ Outputs JSON format logs to stdout
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ Collection (Docker log driver / Promtail)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚      Loki       โ”‚ Indexes labels, stores logs
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ LogQL query
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     Grafana     โ”‚ Visualization
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

LogQL Query Examples

# 1. Query ERROR logs for specific service
{service="device-service", level="ERROR"}

# 2. Query logs containing specific keyword
{service="gateway-service"} |= "timeout"

# 3. Query all logs for specific tenant
{tenantId="tenant-001"}

# 4. Count ERROR logs in last 5 minutes
count_over_time({level="ERROR"}[5m])

# 5. Query logs by Trace ID
{service="data-service"} |= "trace-id-12345"

3.3 Tempo: Efficient Distributed Tracing

Core Function

Tempo is Grafana Labsโ€™ open-source distributed tracing backend, for storing and querying distributed trace data.

Comparison with Jaeger/Zipkin

FeatureTempoJaegerZipkin
Storage CostVery low (only indexes Trace ID)MediumMedium
ScalabilityHigh (relies on object storage)MediumMedium
IntegrationDeeply integrated with Grafana LGTMStandaloneStandalone
Protocol SupportOTLP, Jaeger, ZipkinJaegerZipkin

Why choose Tempo?

  • โœ… Extremely low storage cost (only indexes Trace ID)
  • โœ… Deep integration with Prometheus and Loki
  • โœ… Supports multiple protocols (OTLP, Jaeger, Zipkin)
  • โœ… Suitable for cloud-native environments

How It Works

Request enters Gateway Service
         โ†“
    Generate Trace ID (e.g., abc123)
         โ†“
    Call Device Service (propagate Trace ID)
         โ†“
    Device Service calls database (create Span)
         โ†“
    All Spans sent to Tempo
         โ†“
    Grafana queries and displays complete trace

Trace Structure Example

Trace (Trace ID: abc123)
โ”œโ”€โ”€ Span 1: HTTP GET /api/devices (Gateway Service) - 150ms
โ”‚   โ”œโ”€โ”€ Span 2: Database Query (Device Service) - 50ms
โ”‚   โ””โ”€โ”€ Span 3: Redis Cache (Device Service) - 10ms
โ””โ”€โ”€ Span 4: Kafka Publish (Device Service) - 20ms

Key concepts:

  • Trace ID: Unique identifier for entire request chain
  • Span ID: Unique identifier for single operation
  • Parent Span ID: Parent Spanโ€™s ID (for building call tree)
  • Duration: Operation time spent

TraceQL Query Examples

# 1. Query traces for specific service
{service="gateway-service"}

# 2. Query traces with duration > 1 second
{duration > 1s}

# 3. Query traces containing errors
{status=error}

# 4. Query by Trace ID
{traceId="abc123def456"}

3.4 Grafana: Unified Visualization Platform

Core Function

Grafana is an open-source data visualization and monitoring platform. It doesnโ€™t store data itself but reads from various data sources for display.

Main Features

  1. Multi-datasource support: Supports 30+ data sources including Prometheus, Loki, Tempo, InfluxDB, Elasticsearch
  2. Rich visualizations: Charts, dashboards, tables, heatmaps, and more
  3. Alert management: Metric-based alert rule configuration and notification
  4. Dashboard sharing: Import/export dashboard configurations, share with team

Role in LGTM

Grafana is the unified entry point for the entire stack. Users through Grafana:

  • View Prometheus metric data
  • Query Loki logs
  • Analyze Tempo traces
  • Configure alert rules

Dashboard Organization Recommendations

๐Ÿ“ Project Name
โ”œโ”€โ”€ ๐Ÿ“Š Service Overview (all service health, request volume, error rate)
โ”œโ”€โ”€ ๐Ÿ“Š JVM Details (memory, GC, threads)
โ”œโ”€โ”€ ๐Ÿ“Š Database Monitoring (connection pool, slow queries)
โ”œโ”€โ”€ ๐Ÿ“Š Kafka Monitoring (consumer lag, throughput)
โ””โ”€โ”€ ๐Ÿ“Š Business Metrics (device online count, message volume)

IV. Component Collaboration and Data Flow

4.1 Collaboration of Three Pillars

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  User Request Entry                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚
             โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Gateway Service                        โ”‚
โ”‚  โ€ข Generate Trace ID: abc123                       โ”‚
โ”‚  โ€ข Record HTTP request metrics (Metrics)           โ”‚
โ”‚  โ€ข Output access logs (Logs)                       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚
             โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Device Service                         โ”‚
โ”‚  โ€ข Propagate Trace ID: abc123                      โ”‚
โ”‚  โ€ข Record database query metrics (Metrics)         โ”‚
โ”‚  โ€ข Output business logs (Logs)                     โ”‚
โ”‚  โ€ข Create Span (Database Query - 50ms)             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚
             โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           Data Collection and Storage Layer          โ”‚
โ”‚  โ€ข Prometheus: Collects metric data                 โ”‚
โ”‚  โ€ข Loki: Collects log data                         โ”‚
โ”‚  โ€ข Tempo: Collects trace data                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚
             โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Grafana Unified Visualization           โ”‚
โ”‚  โ€ข Display metric dashboards (Metrics)             โ”‚
โ”‚  โ€ข Query and display logs (Logs)                   โ”‚
โ”‚  โ€ข Analyze distributed traces (Traces)             โ”‚
โ”‚  โ€ข Configure alert rules                           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

4.2 Real-World Application Scenarios

Scenario 1: Rapid Service Failure Localization

Problem: User reports โ€œdevice list loading is slowโ€

Troubleshooting steps:

  1. Check Grafana dashboard โ†’ Discover Gateway Service P95 latency spiked to 5 seconds
  2. Click latency chart โ†’ Jump to Tempo trace โ†’ View slow requestโ€™s trace
  3. Analyze trace โ†’ Discover 80% time spent on Device Service database query
  4. Click Span โ†’ Jump to Loki logs โ†’ Find database connection pool wait alert
  5. Root cause: Database connection pool too small (max 10 connections), exhausted

Solution: Adjust connection pool config maximum-pool-size: 20


Scenario 2: Cross-Service Call Chain Analysis

Problem: Device data upload fails, but unknown which link failed

Troubleshooting steps:

  1. Query error logs in Loki:
    {service="data-service", level="ERROR"} |= "device-data"
    
  2. Extract Trace ID from logs: traceId: xyz789
  3. Query trace in Tempo โ†’ See complete call chain:
    Gateway (10ms) โ†’ Device Service (20ms) โ†’ Kafka (5ms) โ†’ Data Service (150ms)
      โ””โ”€โ”€ Database Insert (140ms) - Failed
    
  4. Click Database Span โ†’ View error message: Duplicate key violation
  5. Root cause: Device repeatedly uploaded same data, primary key conflict

Solution: Add idempotency check to business logic


V. OpenTelemetry: Unified Collection Standard

5.1 What is OpenTelemetry?

OpenTelemetry is a CNCF top-level project with the goal of unifying collection standards for the three pillars of observability (Metrics, Logs, Traces).

5.2 Core Components

  • OTLP (OpenTelemetry Protocol): Unified transport protocol
  • SDK: Multi-language support (Java, Go, Python, Node.js, etc.)
  • Collector: Data collection and forwarding

5.3 Why Emphasize OpenTelemetry?

  1. Avoid vendor lock-in: Standardized collection layer, replaceable backends (Prometheus, Jaeger, Tempo)
  2. Cross-language unification: Java, Go, Python use same protocol and standards
  3. LGTM compatible: Prometheus supports OTLP, Tempo natively supports OTLP
  4. Future trend: Already top 3 in CNCF activity ranking in 2026

5.4 Relationship Between OpenTelemetry and LGTM

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Application Layer                      โ”‚
โ”‚  โ€ข OpenTelemetry SDK (Java/Go/Python)             โ”‚
โ”‚  โ€ข Unified collection of Metrics, Logs, Traces     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚
             โ–ผ OTLP protocol
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          OpenTelemetry Collector                    โ”‚
โ”‚  โ€ข Data receive, process, forward                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚
             โ”œโ”€โ†’ Prometheus (Metrics)
             โ”œโ”€โ†’ Loki (Logs)
             โ””โ”€โ†’ Tempo (Traces)

VI. Comparison of Other Mainstream Monitoring Solutions

SolutionAdvantagesDisadvantagesUse Cases
LGTM StackOpen-source, cloud-native, low resource usageRelatively simple featuresSmall-to-medium enterprises, cloud-native projects
SkyWalkingIntegrated APM, powerful topology analysisHigher resource consumptionLarge enterprises, heavy APM needs
ELK StackPowerful log analysis, mature ecosystemHigh cost, complex maintenanceLog-focused, complex query needs
DatadogComprehensive features, ready-to-useCommercial, high costFast deployment, sufficient budget
Zipkin/JaegerMature tracing featuresNeed to combine with other componentsTracing-specific scenarios

How to choose?

  • โœ… Cloud-native projects: Prioritize LGTM Stack
  • โœ… Heavy APM needs: Consider SkyWalking
  • โœ… Fast deployment: Commercial solution Datadog
  • โœ… Limited budget: Open-source LGTM Stack

VII. What Level Should Developers Master?

7.1 Basic Skills (Essential for All Developers)

Metrics:

  • โœ… Understand Prometheus basic concepts (metric types, labels, time series)
  • โœ… Able to read and understand Grafana dashboards
  • โœ… Able to write simple PromQL queries (e.g., query request rate, error rate)
  • โœ… Understand role of metrics in troubleshooting

Logs:

  • โœ… Understand importance of structured logging
  • โœ… Able to query Loki logs in Grafana
  • โœ… Understand role of Trace ID in logs
  • โœ… Able to quickly locate problems through logs

Traces:

  • โœ… Understand basic concepts of Trace ID, Span ID
  • โœ… Able to view traces in Grafana
  • โœ… Understand role of tracing in performance analysis
  • โœ… Able to locate slow calls through traces

7.2 Applied Skills (Needed for Daily Development)

Metrics:

  • ๐Ÿ”ฅ Able to add metric instrumentation (using Micrometer, Prometheus Client, etc.)
  • ๐Ÿ”ฅ Able to design reasonable business metrics (QPS, latency, error rate)
  • ๐Ÿ”ฅ Able to write complex PromQL queries (aggregation, grouping, filtering)
  • ๐Ÿ”ฅ Able to design simple Grafana dashboards

Logs:

  • ๐Ÿ”ฅ Able to configure structured log output (JSON format)
  • ๐Ÿ”ฅ Able to add Trace ID, tenant ID and other labels to logs
  • ๐Ÿ”ฅ Able to write LogQL queries (filtering, regex matching, statistics)
  • ๐Ÿ”ฅ Able to perform log data masking

Traces:

  • ๐Ÿ”ฅ Able to integrate OpenTelemetry SDK
  • ๐Ÿ”ฅ Able to customize Span attributes (e.g., tenant ID, device ID)
  • ๐Ÿ”ฅ Able to analyze traces and locate performance bottlenecks
  • ๐Ÿ”ฅ Able to understand Trace ID propagation mechanism

Alert Configuration:

  • ๐Ÿ”ฅ Able to configure basic alert rules (service down, high error rate)
  • ๐Ÿ”ฅ Understand alert levels (Critical, Warning, Info)
  • ๐Ÿ”ฅ Able to avoid alert storms

7.3 Architecture Skills (Technical Experts/Architects)

Architecture Design:

  • ๐Ÿš€ Able to design multi-cluster, multi-datacenter monitoring architecture
  • ๐Ÿš€ Able to evaluate and select monitoring solutions (LGTM vs SkyWalking vs Datadog)
  • ๐Ÿš€ Able to plan data retention policies, sampling strategies, storage cost optimization

High Availability Design:

  • ๐Ÿš€ Prometheus high availability (federation, Thanos)
  • ๐Ÿš€ Loki horizontal scaling (read-write separation, object storage)
  • ๐Ÿš€ Tempo distributed deployment

Platform Capabilities:

  • ๐Ÿš€ Build unified observability platform
  • ๐Ÿš€ Implement โ€œone-click onboardingโ€: automatic service registration to monitoring system
  • ๐Ÿš€ Multi-tenant isolation, cost allocation

VIII. Learning Path Recommendations

8.1 Phase 1: Understand Concepts (1-2 weeks)

  1. Learn observability basics

    • Understand three pillars: Metrics, Logs, Traces
    • Learn about Prometheus, Loki, Tempo, Grafana roles
  2. Deploy LGTM Stack

    • Quick deployment using Docker Compose
    • Access each componentโ€™s Web UI
  3. Learn basic queries

    • PromQL basic queries (rate, sum, avg)
    • LogQL basic queries (label filtering, keyword search)
    • TraceQL basic queries (by service, by Trace ID)

8.2 Phase 2: Hands-On Practice (2-4 weeks)

  1. Add monitoring to applications

    • Integrate Spring Boot Actuator + Micrometer
    • Add custom business metrics
    • Configure structured logging
  2. Design Grafana dashboards

    • Create service overview dashboard
    • Create JVM monitoring dashboard
    • Create business metrics dashboard
  3. Configure alert rules

    • Service down alerts
    • Error rate alerts
    • Latency alerts

8.3 Phase 3: Deep Understanding (4+ weeks)

  1. Distributed tracing integration

    • Integrate OpenTelemetry SDK
    • Customize Span attributes
    • Analyze traces to locate performance bottlenecks
  2. Performance optimization

    • Quickly locate root causes through metrics, logs, and traces
    • Optimize slow queries and slow calls
  3. Architecture design

    • Learn monitoring system high-availability architecture
    • Evaluate pros and cons of different monitoring solutions

IX. Frequently Asked Questions (FAQ)

Q1: Whatโ€™s the difference between LGTM Stack and ELK Stack?

AspectLGTM StackELK Stack
Core CapabilityMetrics + Logs + TracesPrimarily Logs
Resource UsageLow (only 10-20% of ELK)High
Deployment ComplexityLowHigh
Query LanguagesPromQL + LogQL + TraceQL (unified style)Lucene (logs only)
Use CasesCloud-native, microservicesLog-focused analysis

Selection recommendations:

  • โœ… If you need complete observability (Metrics + Logs + Traces), choose LGTM
  • โœ… If you only need powerful log analysis, choose ELK

Q2: How does Trace ID propagate between services?

Propagated through HTTP headers:

GET /api/devices HTTP/1.1
Host: gateway-service:8080
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
  • Spring Boot automatically propagates through Micrometer Tracing
  • Kafka messages carry Trace ID through headers
  • All components (database, Redis) automatically associate with same Trace

Q3: What should production Trace sampling rate be?

Recommendations:

  • Production: 10% sampling (avoid excessive storage costs)
  • Testing: 100% sampling (facilitates debugging)
  • Error requests: 100% sampling (must capture all errors)

Q4: How to reduce observability system costs?

  1. Adjust sampling rate: Reduce production trace sampling to 10%
  2. Shorten retention: Keep logs 7 days, traces 3 days, metrics 15 days
  3. Use object storage: Store Loki and Tempo data in S3/MinIO
  4. Aggregate metrics: Only keep key metrics, reduce fine-grained metric storage through pre-aggregation

X. Summary and Outlook

10.1 Key Points Recap

1. Observability is essential in cloud-native era

  • Traditional monitoring cannot handle distributed system complexity
  • Three pillars (Metrics, Logs, Traces) are indispensable

2. LGTM Stack is the current optimal solution

  • Open-source, cloud-native, low resource consumption
  • Deep integration with Spring Boot, Kubernetes

3. Developer capability levels

  • Basic: Understand concepts, able to query and analyze
  • Applied: Able to instrument, configure, design dashboards
  • Architecture: Able to design monitoring architecture, evaluate solutions

4. Clear learning path

  • From understanding concepts โ†’ hands-on practice โ†’ deep understanding
  • Progressive, practice-driven

Trend 1: OpenTelemetry becomes unified standard

  • Already top 3 in CNCF activity ranking in 2026
  • Major companies migrating from proprietary solutions to OpenTelemetry

Trend 2: AI-assisted operations (AIOps)

  • Automatically identify anomaly patterns through machine learning
  • Intelligent alert aggregation, reduce alert noise
  • Automated root cause analysis

Trend 3: Edge computing monitoring

  • Growing demand for IoT device and edge node monitoring
  • Lightweight agents, edge data pre-processing

Trend 4: Observability platformization

  • Unified observability platform (Metrics + Logs + Traces + Events)
  • Multi-tenant isolation, cost allocation, self-service onboarding

10.3 Next Steps

For beginners:

  1. Deploy LGTM Stack immediately (Docker Compose one-click startup)
  2. Add metric instrumentation and structured logging to existing projects
  3. Create first dashboard in Grafana

For experienced developers:

  1. Deep dive into distributed tracing, integrate OpenTelemetry
  2. Optimize alert strategies, reduce alert noise
  3. Explore monitoring system high-availability architecture

For architects:

  1. Evaluate enterprise observability solutions
  2. Design multi-cluster, multi-datacenter monitoring architecture
  3. Explore frontiers like AIOps, edge computing monitoring

Further Reading

Official Documentation:

Community Resources:


In conclusion:

In the cloud-native era, observability is no longer a โ€œnice-to-haveโ€ but a โ€œmust-haveโ€. An excellent developer not only writes high-quality code but also makes systems โ€œvisibleโ€, โ€œclearโ€, and โ€œunderstandableโ€.

Remember this: A system without monitoring is running naked.

I hope this article helps you build a complete observability knowledge framework and stand out in the cloud-native era!


Welcome to follow FishTech Notes for more insights