Enterprise Observability Platform

Problem & Context

The Challenge

Michelin's production systems lacked centralized observability, leading to slow incident detection (MTTR > 4 hours) and reactive troubleshooting. Teams relied on manual log analysis across siloed systems.

Context

Working as Application Support Engineer at Michelin, responsible for ensuring 99.95% uptime of critical business applications serving global operations. Needed proactive monitoring and automated alerting.

System Overview

Built comprehensive observability platform using Splunk Enterprise, ingesting logs from 200+ servers, creating real-time dashboards, automated alerts, and correlation rules. Implemented data pipelines for ETL, custom parsers for legacy formats, and integrated with incident management systems.

Architecture

Event-driven architecture with distributed log collectors, centralized indexers, search heads for analytics, and automated response workflows.

Universal Forwarders

Collect and ship logs from 200+ servers with minimal overhead

Heavy Forwarders

Parse, filter, and enrich data before indexing

Indexers

Store and index 50M+ events/day with replication

Search Heads

Run queries, dashboards, alerts, and correlation searches

Automation Layer

Trigger remediation scripts via API integration

Visual Evidence

Enterprise Observability Platform Architecture: Data pipeline from 200+ servers through forwarders, indexers, search heads, to automation layer

Tech Stack

SplunkPythonPowerShellMongoDBAPI Integration

Key Engineering Decisions

Data Pipeline Architecture

Challenge:

Balancing ingestion throughput (50M events/day) with parsing complexity and storage costs

Solution:

Implemented tiered architecture: Heavy Forwarders for complex parsing, Universal Forwarders for simple collection. Used index-time vs search-time field extraction strategically.

Tradeoffs:

Increased infrastructure complexity but reduced search latency by 70% and storage costs by 40%

Alert Strategy

Challenge:

Avoiding alert fatigue while ensuring zero critical misses

Solution:

Built adaptive threshold system using statistical baselines, correlation rules to deduplicate related alerts, and escalation policies with severity levels.

Tradeoffs:

Required 2 weeks of baseline tuning but reduced false positives by 80%

Results & Impact

Mean Time to Detect (MTTD)

93%

Before:45 min

After:3 min

Mean Time to Resolve (MTTR)

60%

Before:4.5 hours

After:1.8 hours

Incident Prevention

40% fewer outages

Before:Reactive

After:40% proactive

Engineer Productivity

83% time saved

Before:6 hr/week on logs

After:1 hr/week

Failures & Learnings

Over-engineering dashboards is worse than no dashboards - focus on actionable metrics

Parsing at index-time vs search-time has massive performance implications - profile before deciding

Alert fatigue is real - statistical baselines > static thresholds

Integration with incident management (PagerDuty, ServiceNow) is critical for adoption

Documentation for runbooks must be inline with alerts, not linked externally