Job Responsibilities:
We are seeking a Director of Software Engineering, AI Deployment with a strong SRE orientation to lead the software and systems engineering required to bring cutting-edge AI models to production. This role is responsible for engineering AI model services, building supporting infrastructure, and ensuring that the systems are scalable, resilient, observable, and production-ready.
You’ll oversee teams building backend APIs, automation tooling, and deployment pipelines, while also defining performance, availability, and reliability standards. This role operates at the intersection of software engineering, DevOps, and AI.
You’ll oversee teams building backend APIs, automation tooling, and deployment pipelines, while also defining performance, availability, and reliability standards. This role operates at the intersection of software engineering, DevOps, and AI.
Essential Duties and Responsibilities
- Lead the design and engineering of scalable, robust, and testable software systems that wrap and serve AI/ML models.
- Drive development of reusable APIs, frameworks, and libraries to accelerate integration of AI into customer-facing products.
- Oversee engineering of high-performance model inference systems, with a focus on both cloud-native and on-premise environments.
- Architect backend services that are API-first, containerized, and designed for high availability
- Ensure all services are testable, observable, and meet handoff criteria for release candidate testing by the QA team for continuous integration, automated validation, and smooth production rollout
- Define and implement SLOs, SLIs, and error budgets for model-backed services.
- Drive implementation of robust monitoring, alerting, logging, and auto-recovery mechanisms.
- Build resilience and observability into AI systems by design and implement incident response protocols, runbooks, and reliability audits
- Lead efforts to optimize AI model serving performance: memory, compute, GPU usage, latency, and cost-efficiency.
- Architect systems that can scale elastically based on demand, while maintaining deterministic behavior and uptime guarantees.
- Oversee buildout of deployment automation tools, CI/CD for models and software components, and rollback systems.
- Manage and grow a team of software and systems engineers responsible for end-to-end AI system readiness.
- Set strategy for software delivery, technical quality, operational metrics, and performance benchmarks.
Pre-Requisites:
Qualifications
- 10+ years in software engineering, with 4+ years in engineering leadership or director roles.
- Demonstrated experience building and running production-grade AI/ML systems.
- Deep expertise in backend development, API design, and cloud infrastructure (AWS, GCP, or Azure).
- Solid grounding in SRE principles — including incident response, observability, error budgeting, and reliability metrics.
- Strong knowledge of site reliability tooling (e.g., Prometheus, Grafana, OpenTelemetry, Sentry)
- Familiarity with model serving frameworks (e.g., Triton, TorchServe, Ray Serve), and GPU compute orchestration
- Familiarity with LLM pipelines, streaming inference, or hybrid deployment environments (cloud + edge)
- Prior ownership of large-scale AI delivery platforms or model hosting infrastructure.
- Experience with CI/CD, Software Development Lifecycle for Software Systems, AI model lifecycle tooling, and infrastructure-as-code.
- Excellent problem-solving, analytical, and decision-making abilities.
- Strong communication and stakeholder management skills.
- Ability and willingness to learn any new technologies and apply them at work in order to stay ahead, in a fast paced, high pressure, agile environment
- Excellent written and verbal communication skills for coordinating across teams.
Education & Experience
Bachelor's or Master's in Computer Science, Software Engineering, or equivalent
Travel Requirements
- Role based in Singapore office and may require up to 1 travel trip per year.