Job Listing

Senior AI Operations Engineer

Senior AI Ops Engineer

Must Haves

  • 5+ years in Data Science, ML Engineering, or AI Operations
  • Hands-on experience monitoring ML models in production
  • Strong understanding of SLIs/SLOs, model drift, data anomalies, and performance degradation
  • Ability to perform deep-dive SQL investigations (Snowflake or equivalent)
  • Experience supporting high‑severity incidents (P1/P2) and driving structured postmortems
  • Familiarity with Azure or other cloud environments
  • Experience with CI/CD pipelines, Docker, and Kubernetes
  • Ability to stay structured, analytical, and calm during incident response

Plusses

  • Experience with Datadog, Grafana, Prometheus, or similar observability tools
  • Background supporting enterprise-scale ML systems
  • Exposure to model deployment readiness, rollback strategies, and release standards
  • Prior participation in on-call rotations
  • Experience improving alert quality, dashboards, and telemetry

Day-to-Day

You’ll ensure ML systems run reliably in production by monitoring model health, defining SLIs/SLOs, and investigating drift, anomalies, and performance issues. You’ll lead P1/P2 incident triage, drive postmortems, and partner with Data Science and Platform teams to validate deployment readiness and maintain operational excellence. Your work includes deep SQL investigations, improving observability dashboards, refining alerting, and owning critical incidents as part of the on-call rotation.

Job details