Still Relying on Restart Magic for Troubleshooting? Master Cloud-Native Monitoring with LGTM Stack
3/11/2026
In todayโs world where distributed microservice architectures have become mainstream, system complexity is growing exponentially. When online issues occur, are you still relying on โrestart magicโ and โlog diggingโ to locate problems? When your service count grows from 10 to 100, are you still using traditional monitoring methods? Itโs time to build a modern observability framework.
I. Monitoring Challenges and Solutions in the Cloud-Native Era
1.1 Limitations of Traditional Monitoring
Remember the monolithic application era? One application, one database, one log fileโwhen something went wrong, you could locate the issue just by checking the logs. But what about now?
Real-world scenario:
- User reports โdevice list loading is slowโ
- You check Gateway Service logs โ Normal
- You check Device Service logs โ A bit slow but not obvious
- You check database logs โ Connection pool normal
- You check Redis logs โ No problem either
- Finally discovered: Kafka consumer lag caused Data Service slow response, affecting the entire call chain
This is the black box dilemma of distributed systems: a single request may span 5-10 services, and traditional monitoring methods simply cannot track the complete call chain.
1.2 What is Observability?
Observability refers to the ability to infer a systemโs internal state through its external outputs. Unlike traditional monitoring, observability emphasizes proactive issue detection and rapid root cause localization.
Observability is built on three pillars:

| Pillar | Purpose | Problem Solved | Typical Tools |
|---|---|---|---|
| Metrics | Overall view of system status | โHow is the system doing now?โ | Prometheus + Grafana |
| Logs | Detailed event records | โWhat exactly happened?โ | Loki / ELK |
| Traces | Complete call chain of requests | โWhich services did the request go through? Where was time spent?โ | Tempo / Jaeger / SkyWalking |
How do the three pillars work together?
- Metrics discover issues: Dashboard reveals โdevice list API P95 latency spiked to 5 secondsโ
- Traces locate bottlenecks: Distributed tracing reveals โ80% time spent on database queryโ
- Logs show details: Logs reveal โdatabase connection pool wait alert, connections exhaustedโ
II. LGTM Stack: The De Facto Standard for Cloud-Native Monitoring
2.1 What is LGTM Stack?
LGTM Stack is currently the most popular open-source observability solution, composed of four core components:
- Loki - Log aggregation system
- Grafana - Unified visualization platform
- Tempo - Distributed tracing
- Prometheus๏ผM๏ผ - Metric collection and storage
Note: Although the acronym is LGTM, components are typically learned and used in the order: Prometheus โ Loki โ Tempo โ Grafana.

2.2 Why Choose LGTM Stack?
| Aspect | LGTM Stack | Traditional Solutions (e.g., ELK + Zipkin) |
|---|---|---|
| Cost | Low (resource consumption only 10-20% of ELK) | High |
| Deployment Complexity | Low (cloud-native design) | High |
| Learning Curve | Gentle (unified query style) | Steep (multiple query languages) |
| Integration | High (Grafana unified visualization) | Low (multiple independent systems) |
| Use Cases | Cloud-native, Kubernetes, microservices | Traditional architectures |
III. Deep Dive into LGTM Components
3.1 Prometheus: Metric Collection and Storage
Core Function
Prometheus is an open-source system monitoring and alerting tool, primarily responsible for collecting, storing, and querying time-series metric data.
How It Works
โโโโโโโโโโโโโโโโโโโ
โ Application โ Exposes /metrics endpoint
โ (Spring Boot) โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ Pull mode (every 15 seconds)
โโโโโโโโโโโโโโโโโโโ
โ Prometheus โ Collects and stores metrics
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ PromQL query
โโโโโโโโโโโโโโโโโโโ
โ Grafana โ Visualization
โโโโโโโโโโโโโโโโโโโ
Key Features
- Pull-based collection: Prometheus actively pulls metrics from target services (default 15 seconds)
- Multi-dimensional data model: Flexible querying through labels
# Example: Query request rate grouped by service and status code http_requests_total{method="GET", status="200"} - PromQL query language: Powerful time-series data querying capability
- Alert management: Rule-based alerting and notification
Four Metric Types
| Metric Type | Characteristics | Use Cases | Example |
|---|---|---|---|
| Counter | Monotonically increasing | Cumulative values (total requests, total errors) | http_requests_total |
| Gauge | Can go up or down | Instantaneous values (current memory, current connections) | memory_usage_bytes |
| Histogram | Distribution statistics | Latency distribution, request size distribution | http_request_duration_seconds |
| Summary | Quantile statistics | Pre-calculated quantiles | request_duration_seconds{quantile="0.95"} |
Common PromQL Queries
# 1. Query request rate over last 5 minutes
rate(http_requests_total[5m])
# 2. Query P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# 3. Query error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# 4. Group by service
sum by (service) (rate(http_requests_total[5m]))
3.2 Loki: Lightweight Log Aggregation
Core Function
Loki is Grafana Labsโ open-source log aggregation system, specialized in storing and querying logs. Its design is inspired by Prometheus, using labels to index logs.
Differences from ELK
| Aspect | Loki | ELK (Elasticsearch) |
|---|---|---|
| Storage | Indexes labels only, not log content | Full-text indexing |
| Resource Usage | Low (only 10-20% of ELK) | High |
| Query Language | LogQL (similar to PromQL) | Lucene |
| Use Cases | Cloud-native, Kubernetes | Complex log analysis |
Why choose Loki?
- โ Deep integration with Prometheus and Grafana
- โ Low resource consumption, suitable for small-to-medium deployments
- โ Simple operation, gentle learning curve
- โ Multi-tenant isolation support
How It Works
โโโโโโโโโโโโโโโโโโโ
โ Application โ Outputs JSON format logs to stdout
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ Collection (Docker log driver / Promtail)
โโโโโโโโโโโโโโโโโโโ
โ Loki โ Indexes labels, stores logs
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ LogQL query
โโโโโโโโโโโโโโโโโโโ
โ Grafana โ Visualization
โโโโโโโโโโโโโโโโโโโ
LogQL Query Examples
# 1. Query ERROR logs for specific service
{service="device-service", level="ERROR"}
# 2. Query logs containing specific keyword
{service="gateway-service"} |= "timeout"
# 3. Query all logs for specific tenant
{tenantId="tenant-001"}
# 4. Count ERROR logs in last 5 minutes
count_over_time({level="ERROR"}[5m])
# 5. Query logs by Trace ID
{service="data-service"} |= "trace-id-12345"
3.3 Tempo: Efficient Distributed Tracing
Core Function
Tempo is Grafana Labsโ open-source distributed tracing backend, for storing and querying distributed trace data.
Comparison with Jaeger/Zipkin
| Feature | Tempo | Jaeger | Zipkin |
|---|---|---|---|
| Storage Cost | Very low (only indexes Trace ID) | Medium | Medium |
| Scalability | High (relies on object storage) | Medium | Medium |
| Integration | Deeply integrated with Grafana LGTM | Standalone | Standalone |
| Protocol Support | OTLP, Jaeger, Zipkin | Jaeger | Zipkin |
Why choose Tempo?
- โ Extremely low storage cost (only indexes Trace ID)
- โ Deep integration with Prometheus and Loki
- โ Supports multiple protocols (OTLP, Jaeger, Zipkin)
- โ Suitable for cloud-native environments
How It Works
Request enters Gateway Service
โ
Generate Trace ID (e.g., abc123)
โ
Call Device Service (propagate Trace ID)
โ
Device Service calls database (create Span)
โ
All Spans sent to Tempo
โ
Grafana queries and displays complete trace
Trace Structure Example
Trace (Trace ID: abc123)
โโโ Span 1: HTTP GET /api/devices (Gateway Service) - 150ms
โ โโโ Span 2: Database Query (Device Service) - 50ms
โ โโโ Span 3: Redis Cache (Device Service) - 10ms
โโโ Span 4: Kafka Publish (Device Service) - 20ms
Key concepts:
- Trace ID: Unique identifier for entire request chain
- Span ID: Unique identifier for single operation
- Parent Span ID: Parent Spanโs ID (for building call tree)
- Duration: Operation time spent
TraceQL Query Examples
# 1. Query traces for specific service
{service="gateway-service"}
# 2. Query traces with duration > 1 second
{duration > 1s}
# 3. Query traces containing errors
{status=error}
# 4. Query by Trace ID
{traceId="abc123def456"}
3.4 Grafana: Unified Visualization Platform
Core Function
Grafana is an open-source data visualization and monitoring platform. It doesnโt store data itself but reads from various data sources for display.
Main Features
- Multi-datasource support: Supports 30+ data sources including Prometheus, Loki, Tempo, InfluxDB, Elasticsearch
- Rich visualizations: Charts, dashboards, tables, heatmaps, and more
- Alert management: Metric-based alert rule configuration and notification
- Dashboard sharing: Import/export dashboard configurations, share with team
Role in LGTM
Grafana is the unified entry point for the entire stack. Users through Grafana:
- View Prometheus metric data
- Query Loki logs
- Analyze Tempo traces
- Configure alert rules

Dashboard Organization Recommendations
๐ Project Name
โโโ ๐ Service Overview (all service health, request volume, error rate)
โโโ ๐ JVM Details (memory, GC, threads)
โโโ ๐ Database Monitoring (connection pool, slow queries)
โโโ ๐ Kafka Monitoring (consumer lag, throughput)
โโโ ๐ Business Metrics (device online count, message volume)
IV. Component Collaboration and Data Flow
4.1 Collaboration of Three Pillars
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ User Request Entry โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Gateway Service โ
โ โข Generate Trace ID: abc123 โ
โ โข Record HTTP request metrics (Metrics) โ
โ โข Output access logs (Logs) โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Device Service โ
โ โข Propagate Trace ID: abc123 โ
โ โข Record database query metrics (Metrics) โ
โ โข Output business logs (Logs) โ
โ โข Create Span (Database Query - 50ms) โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data Collection and Storage Layer โ
โ โข Prometheus: Collects metric data โ
โ โข Loki: Collects log data โ
โ โข Tempo: Collects trace data โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Grafana Unified Visualization โ
โ โข Display metric dashboards (Metrics) โ
โ โข Query and display logs (Logs) โ
โ โข Analyze distributed traces (Traces) โ
โ โข Configure alert rules โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
4.2 Real-World Application Scenarios
Scenario 1: Rapid Service Failure Localization
Problem: User reports โdevice list loading is slowโ
Troubleshooting steps:
- Check Grafana dashboard โ Discover Gateway Service P95 latency spiked to 5 seconds
- Click latency chart โ Jump to Tempo trace โ View slow requestโs trace
- Analyze trace โ Discover 80% time spent on Device Service database query
- Click Span โ Jump to Loki logs โ Find database connection pool wait alert
- Root cause: Database connection pool too small (max 10 connections), exhausted
Solution: Adjust connection pool config maximum-pool-size: 20
Scenario 2: Cross-Service Call Chain Analysis
Problem: Device data upload fails, but unknown which link failed
Troubleshooting steps:
- Query error logs in Loki:
{service="data-service", level="ERROR"} |= "device-data" - Extract Trace ID from logs:
traceId: xyz789 - Query trace in Tempo โ See complete call chain:
Gateway (10ms) โ Device Service (20ms) โ Kafka (5ms) โ Data Service (150ms) โโโ Database Insert (140ms) - Failed - Click Database Span โ View error message:
Duplicate key violation - Root cause: Device repeatedly uploaded same data, primary key conflict
Solution: Add idempotency check to business logic
V. OpenTelemetry: Unified Collection Standard
5.1 What is OpenTelemetry?
OpenTelemetry is a CNCF top-level project with the goal of unifying collection standards for the three pillars of observability (Metrics, Logs, Traces).
5.2 Core Components
- OTLP (OpenTelemetry Protocol): Unified transport protocol
- SDK: Multi-language support (Java, Go, Python, Node.js, etc.)
- Collector: Data collection and forwarding
5.3 Why Emphasize OpenTelemetry?
- Avoid vendor lock-in: Standardized collection layer, replaceable backends (Prometheus, Jaeger, Tempo)
- Cross-language unification: Java, Go, Python use same protocol and standards
- LGTM compatible: Prometheus supports OTLP, Tempo natively supports OTLP
- Future trend: Already top 3 in CNCF activity ranking in 2026
5.4 Relationship Between OpenTelemetry and LGTM
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Application Layer โ
โ โข OpenTelemetry SDK (Java/Go/Python) โ
โ โข Unified collection of Metrics, Logs, Traces โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ OTLP protocol
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OpenTelemetry Collector โ
โ โข Data receive, process, forward โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโ Prometheus (Metrics)
โโโ Loki (Logs)
โโโ Tempo (Traces)
VI. Comparison of Other Mainstream Monitoring Solutions
| Solution | Advantages | Disadvantages | Use Cases |
|---|---|---|---|
| LGTM Stack | Open-source, cloud-native, low resource usage | Relatively simple features | Small-to-medium enterprises, cloud-native projects |
| SkyWalking | Integrated APM, powerful topology analysis | Higher resource consumption | Large enterprises, heavy APM needs |
| ELK Stack | Powerful log analysis, mature ecosystem | High cost, complex maintenance | Log-focused, complex query needs |
| Datadog | Comprehensive features, ready-to-use | Commercial, high cost | Fast deployment, sufficient budget |
| Zipkin/Jaeger | Mature tracing features | Need to combine with other components | Tracing-specific scenarios |
How to choose?
- โ Cloud-native projects: Prioritize LGTM Stack
- โ Heavy APM needs: Consider SkyWalking
- โ Fast deployment: Commercial solution Datadog
- โ Limited budget: Open-source LGTM Stack
VII. What Level Should Developers Master?
7.1 Basic Skills (Essential for All Developers)
Metrics:
- โ Understand Prometheus basic concepts (metric types, labels, time series)
- โ Able to read and understand Grafana dashboards
- โ Able to write simple PromQL queries (e.g., query request rate, error rate)
- โ Understand role of metrics in troubleshooting
Logs:
- โ Understand importance of structured logging
- โ Able to query Loki logs in Grafana
- โ Understand role of Trace ID in logs
- โ Able to quickly locate problems through logs
Traces:
- โ Understand basic concepts of Trace ID, Span ID
- โ Able to view traces in Grafana
- โ Understand role of tracing in performance analysis
- โ Able to locate slow calls through traces
7.2 Applied Skills (Needed for Daily Development)
Metrics:
- ๐ฅ Able to add metric instrumentation (using Micrometer, Prometheus Client, etc.)
- ๐ฅ Able to design reasonable business metrics (QPS, latency, error rate)
- ๐ฅ Able to write complex PromQL queries (aggregation, grouping, filtering)
- ๐ฅ Able to design simple Grafana dashboards
Logs:
- ๐ฅ Able to configure structured log output (JSON format)
- ๐ฅ Able to add Trace ID, tenant ID and other labels to logs
- ๐ฅ Able to write LogQL queries (filtering, regex matching, statistics)
- ๐ฅ Able to perform log data masking
Traces:
- ๐ฅ Able to integrate OpenTelemetry SDK
- ๐ฅ Able to customize Span attributes (e.g., tenant ID, device ID)
- ๐ฅ Able to analyze traces and locate performance bottlenecks
- ๐ฅ Able to understand Trace ID propagation mechanism
Alert Configuration:
- ๐ฅ Able to configure basic alert rules (service down, high error rate)
- ๐ฅ Understand alert levels (Critical, Warning, Info)
- ๐ฅ Able to avoid alert storms
7.3 Architecture Skills (Technical Experts/Architects)
Architecture Design:
- ๐ Able to design multi-cluster, multi-datacenter monitoring architecture
- ๐ Able to evaluate and select monitoring solutions (LGTM vs SkyWalking vs Datadog)
- ๐ Able to plan data retention policies, sampling strategies, storage cost optimization
High Availability Design:
- ๐ Prometheus high availability (federation, Thanos)
- ๐ Loki horizontal scaling (read-write separation, object storage)
- ๐ Tempo distributed deployment
Platform Capabilities:
- ๐ Build unified observability platform
- ๐ Implement โone-click onboardingโ: automatic service registration to monitoring system
- ๐ Multi-tenant isolation, cost allocation
VIII. Learning Path Recommendations
8.1 Phase 1: Understand Concepts (1-2 weeks)
-
Learn observability basics
- Understand three pillars: Metrics, Logs, Traces
- Learn about Prometheus, Loki, Tempo, Grafana roles
-
Deploy LGTM Stack
- Quick deployment using Docker Compose
- Access each componentโs Web UI
-
Learn basic queries
- PromQL basic queries (rate, sum, avg)
- LogQL basic queries (label filtering, keyword search)
- TraceQL basic queries (by service, by Trace ID)
8.2 Phase 2: Hands-On Practice (2-4 weeks)
-
Add monitoring to applications
- Integrate Spring Boot Actuator + Micrometer
- Add custom business metrics
- Configure structured logging
-
Design Grafana dashboards
- Create service overview dashboard
- Create JVM monitoring dashboard
- Create business metrics dashboard
-
Configure alert rules
- Service down alerts
- Error rate alerts
- Latency alerts
8.3 Phase 3: Deep Understanding (4+ weeks)
-
Distributed tracing integration
- Integrate OpenTelemetry SDK
- Customize Span attributes
- Analyze traces to locate performance bottlenecks
-
Performance optimization
- Quickly locate root causes through metrics, logs, and traces
- Optimize slow queries and slow calls
-
Architecture design
- Learn monitoring system high-availability architecture
- Evaluate pros and cons of different monitoring solutions
IX. Frequently Asked Questions (FAQ)
Q1: Whatโs the difference between LGTM Stack and ELK Stack?
| Aspect | LGTM Stack | ELK Stack |
|---|---|---|
| Core Capability | Metrics + Logs + Traces | Primarily Logs |
| Resource Usage | Low (only 10-20% of ELK) | High |
| Deployment Complexity | Low | High |
| Query Languages | PromQL + LogQL + TraceQL (unified style) | Lucene (logs only) |
| Use Cases | Cloud-native, microservices | Log-focused analysis |
Selection recommendations:
- โ If you need complete observability (Metrics + Logs + Traces), choose LGTM
- โ If you only need powerful log analysis, choose ELK
Q2: How does Trace ID propagate between services?
Propagated through HTTP headers:
GET /api/devices HTTP/1.1
Host: gateway-service:8080
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
- Spring Boot automatically propagates through
Micrometer Tracing - Kafka messages carry Trace ID through headers
- All components (database, Redis) automatically associate with same Trace
Q3: What should production Trace sampling rate be?
Recommendations:
- Production: 10% sampling (avoid excessive storage costs)
- Testing: 100% sampling (facilitates debugging)
- Error requests: 100% sampling (must capture all errors)
Q4: How to reduce observability system costs?
- Adjust sampling rate: Reduce production trace sampling to 10%
- Shorten retention: Keep logs 7 days, traces 3 days, metrics 15 days
- Use object storage: Store Loki and Tempo data in S3/MinIO
- Aggregate metrics: Only keep key metrics, reduce fine-grained metric storage through pre-aggregation
X. Summary and Outlook
10.1 Key Points Recap
1. Observability is essential in cloud-native era
- Traditional monitoring cannot handle distributed system complexity
- Three pillars (Metrics, Logs, Traces) are indispensable
2. LGTM Stack is the current optimal solution
- Open-source, cloud-native, low resource consumption
- Deep integration with Spring Boot, Kubernetes
3. Developer capability levels
- Basic: Understand concepts, able to query and analyze
- Applied: Able to instrument, configure, design dashboards
- Architecture: Able to design monitoring architecture, evaluate solutions
4. Clear learning path
- From understanding concepts โ hands-on practice โ deep understanding
- Progressive, practice-driven
10.2 Future Trends
Trend 1: OpenTelemetry becomes unified standard
- Already top 3 in CNCF activity ranking in 2026
- Major companies migrating from proprietary solutions to OpenTelemetry
Trend 2: AI-assisted operations (AIOps)
- Automatically identify anomaly patterns through machine learning
- Intelligent alert aggregation, reduce alert noise
- Automated root cause analysis
Trend 3: Edge computing monitoring
- Growing demand for IoT device and edge node monitoring
- Lightweight agents, edge data pre-processing
Trend 4: Observability platformization
- Unified observability platform (Metrics + Logs + Traces + Events)
- Multi-tenant isolation, cost allocation, self-service onboarding
10.3 Next Steps
For beginners:
- Deploy LGTM Stack immediately (Docker Compose one-click startup)
- Add metric instrumentation and structured logging to existing projects
- Create first dashboard in Grafana
For experienced developers:
- Deep dive into distributed tracing, integrate OpenTelemetry
- Optimize alert strategies, reduce alert noise
- Explore monitoring system high-availability architecture
For architects:
- Evaluate enterprise observability solutions
- Design multi-cluster, multi-datacenter monitoring architecture
- Explore frontiers like AIOps, edge computing monitoring
Further Reading
Official Documentation:
- Prometheus Official Documentation
- Grafana Official Documentation
- Loki Official Documentation
- Tempo Official Documentation
- OpenTelemetry Official Documentation
Community Resources:
In conclusion:
In the cloud-native era, observability is no longer a โnice-to-haveโ but a โmust-haveโ. An excellent developer not only writes high-quality code but also makes systems โvisibleโ, โclearโ, and โunderstandableโ.
Remember this: A system without monitoring is running naked.
I hope this article helps you build a complete observability knowledge framework and stand out in the cloud-native era!
Welcome to follow FishTech Notes for more insights