Production monitoring system processing 50M+ events/day
Michelin's production systems lacked centralized observability, leading to slow incident detection (MTTR > 4 hours) and reactive troubleshooting. Teams relied on manual log analysis across siloed systems.
Working as Application Support Engineer at Michelin, responsible for ensuring 99.95% uptime of critical business applications serving global operations. Needed proactive monitoring and automated alerting.
Built comprehensive observability platform using Splunk Enterprise, ingesting logs from 200+ servers, creating real-time dashboards, automated alerts, and correlation rules. Implemented data pipelines for ETL, custom parsers for legacy formats, and integrated with incident management systems.
Event-driven architecture with distributed log collectors, centralized indexers, search heads for analytics, and automated response workflows.

Enterprise Observability Platform Architecture: Data pipeline from 200+ servers through forwarders, indexers, search heads, to automation layer
Balancing ingestion throughput (50M events/day) with parsing complexity and storage costs
Implemented tiered architecture: Heavy Forwarders for complex parsing, Universal Forwarders for simple collection. Used index-time vs search-time field extraction strategically.
Increased infrastructure complexity but reduced search latency by 70% and storage costs by 40%
Avoiding alert fatigue while ensuring zero critical misses
Built adaptive threshold system using statistical baselines, correlation rules to deduplicate related alerts, and escalation policies with severity levels.
Required 2 weeks of baseline tuning but reduced false positives by 80%
Over-engineering dashboards is worse than no dashboards - focus on actionable metrics
Parsing at index-time vs search-time has massive performance implications - profile before deciding
Alert fatigue is real - statistical baselines > static thresholds
Integration with incident management (PagerDuty, ServiceNow) is critical for adoption
Documentation for runbooks must be inline with alerts, not linked externally