Back to Projects
BackendInfrastructureObservability

Enterprise Observability Platform

Production monitoring system processing 50M+ events/day

2024 - Present
50M+
Events/Day
60%
MTTR Reduction
99%
Monitoring Coverage

Problem & Context

The Challenge

Michelin's production systems lacked centralized observability, leading to slow incident detection (MTTR > 4 hours) and reactive troubleshooting. Teams relied on manual log analysis across siloed systems.

Context

Working as Application Support Engineer at Michelin, responsible for ensuring 99.95% uptime of critical business applications serving global operations. Needed proactive monitoring and automated alerting.

System Overview

Built comprehensive observability platform using Splunk Enterprise, ingesting logs from 200+ servers, creating real-time dashboards, automated alerts, and correlation rules. Implemented data pipelines for ETL, custom parsers for legacy formats, and integrated with incident management systems.

Architecture

Event-driven architecture with distributed log collectors, centralized indexers, search heads for analytics, and automated response workflows.

Universal Forwarders
Collect and ship logs from 200+ servers with minimal overhead
Heavy Forwarders
Parse, filter, and enrich data before indexing
Indexers
Store and index 50M+ events/day with replication
Search Heads
Run queries, dashboards, alerts, and correlation searches
Automation Layer
Trigger remediation scripts via API integration

Visual Evidence

Enterprise Observability Platform Architecture: Data pipeline from 200+ servers through forwarders, indexers, search heads, to automation layer

Enterprise Observability Platform Architecture: Data pipeline from 200+ servers through forwarders, indexers, search heads, to automation layer

Tech Stack

SplunkPythonPowerShellMongoDBAPI Integration

Key Engineering Decisions

Data Pipeline Architecture

Challenge:

Balancing ingestion throughput (50M events/day) with parsing complexity and storage costs

Solution:

Implemented tiered architecture: Heavy Forwarders for complex parsing, Universal Forwarders for simple collection. Used index-time vs search-time field extraction strategically.

Tradeoffs:

Increased infrastructure complexity but reduced search latency by 70% and storage costs by 40%

Alert Strategy

Challenge:

Avoiding alert fatigue while ensuring zero critical misses

Solution:

Built adaptive threshold system using statistical baselines, correlation rules to deduplicate related alerts, and escalation policies with severity levels.

Tradeoffs:

Required 2 weeks of baseline tuning but reduced false positives by 80%

Results & Impact

Mean Time to Detect (MTTD)

93%
Before:45 min
After:3 min

Mean Time to Resolve (MTTR)

60%
Before:4.5 hours
After:1.8 hours

Incident Prevention

40% fewer outages
Before:Reactive
After:40% proactive

Engineer Productivity

83% time saved
Before:6 hr/week on logs
After:1 hr/week

Failures & Learnings

1

Over-engineering dashboards is worse than no dashboards - focus on actionable metrics

2

Parsing at index-time vs search-time has massive performance implications - profile before deciding

3

Alert fatigue is real - statistical baselines > static thresholds

4

Integration with incident management (PagerDuty, ServiceNow) is critical for adoption

5

Documentation for runbooks must be inline with alerts, not linked externally