Senior AI Ops Engineer
Must Haves
- 5+ years in Data Science, ML Engineering, or AI Operations
- Hands-on experience monitoring ML models in production
- Strong understanding of SLIs/SLOs, model drift, data anomalies, and performance degradation
- Ability to perform deep-dive SQL investigations (Snowflake or equivalent)
- Experience supporting high‑severity incidents (P1/P2) and driving structured postmortems
- Familiarity with Azure or other cloud environments
- Experience with CI/CD pipelines, Docker, and Kubernetes
- Ability to stay structured, analytical, and calm during incident response
Plusses
- Experience with Datadog, Grafana, Prometheus, or similar observability tools
- Background supporting enterprise-scale ML systems
- Exposure to model deployment readiness, rollback strategies, and release standards
- Prior participation in on-call rotations
- Experience improving alert quality, dashboards, and telemetry
Day-to-Day
You’ll ensure ML systems run reliably in production by monitoring model health, defining SLIs/SLOs, and investigating drift, anomalies, and performance issues. You’ll lead P1/P2 incident triage, drive postmortems, and partner with Data Science and Platform teams to validate deployment readiness and maintain operational excellence. Your work includes deep SQL investigations, improving observability dashboards, refining alerting, and owning critical incidents as part of the on-call rotation.