Senior AI Ops Engineer to ensure machine learning systems operate reliably and predictably in production at enterprise scale. This role focuses on model health, data integrity, monitoring, and incident leadership. Will work closely with Data Science and platform teams to ensure AI systems perform correctly under real-world conditions.
What You’ll Do
– Monitor ML models in production and define meaningful SLIs/SLOs
– Detect and investigate model drift, data anomalies, and performance degradation
– Perform deep-dive SQL investigations (Snowflake or equivalent)
– Lead P1/P2 incident triage and drive structured postmortems
– Improve monitoring dashboards and alert quality (Datadog, Grafana, Prometheus)
– Validate deployment readiness, rollback strategies, and release standards
– Participate in on-call rotation and own critical incidents