Senior AI Operations Engineer

Senior AI Ops Engineer

Must Haves

5+ years in Data Science, ML Engineering, or AI Operations
Hands-on experience monitoring ML models in production
Strong understanding of SLIs/SLOs, model drift, data anomalies, and performance degradation
Ability to perform deep-dive SQL investigations (Snowflake or equivalent)
Experience supporting high‑severity incidents (P1/P2) and driving structured postmortems
Familiarity with Azure or other cloud environments
Experience with CI/CD pipelines, Docker, and Kubernetes
Ability to stay structured, analytical, and calm during incident response

Plusses

Experience with Datadog, Grafana, Prometheus, or similar observability tools
Background supporting enterprise-scale ML systems
Exposure to model deployment readiness, rollback strategies, and release standards
Prior participation in on-call rotations
Experience improving alert quality, dashboards, and telemetry

Day-to-Day

You’ll ensure ML systems run reliably in production by monitoring model health, defining SLIs/SLOs, and investigating drift, anomalies, and performance issues. You’ll lead P1/P2 incident triage, drive postmortems, and partner with Data Science and Platform teams to validate deployment readiness and maintain operational excellence. Your work includes deep SQL investigations, improving observability dashboards, refining alerting, and owning critical incidents as part of the on-call rotation.

Job Listing

Job details

Subscribe to get an update