Director of Software Engineering, AI Deployment

Razer

10+ Years | Singapore, Singapore (On Site) | Full Time | 3 months ago

Apply Now

Job Summary

As Director of Software Engineering, AI Deployment, you will lead teams in engineering scalable, resilient, and observable AI model services and supporting infrastructure. This role focuses on bringing cutting-edge AI models to production, overseeing backend APIs, automation tooling, and deployment pipelines, and defining performance and reliability standards. You will operate at the intersection of software engineering, DevOps, and AI, ensuring systems are production-ready and optimized for performance and cost-efficiency.

Must Have

Lead design and engineering of scalable, robust, and testable software systems that wrap and serve AI/ML models.
Drive development of reusable APIs, frameworks, and libraries to accelerate integration of AI into customer-facing products.
Oversee engineering of high-performance model inference systems, with a focus on both cloud-native and on-premise environments.
Architect backend services that are API-first, containerized, and designed for high availability.
Ensure all services are testable, observable, and meet handoff criteria for release candidate testing by the QA team for continuous integration, automated validation, and smooth production rollout.
Define and implement SLOs, SLIs, and error budgets for model-backed services.
Drive implementation of robust monitoring, alerting, logging, and auto-recovery mechanisms.
Build resilience and observability into AI systems by design and implement incident response protocols, runbooks, and reliability audits.
Lead efforts to optimize AI model serving performance: memory, compute, GPU usage, latency, and cost-efficiency.
Architect systems that can scale elastically based on demand, while maintaining deterministic behavior and uptime guarantees.
Oversee buildout of deployment automation tools, CI/CD for models and software components, and rollback systems.
Manage and grow a team of software and systems engineers responsible for end-to-end AI system readiness.
Set strategy for software delivery, technical quality, operational metrics, and performance benchmarks.
10+ years in software engineering, with 4+ years in engineering leadership or director roles.
Demonstrated experience building and running production-grade AI/ML systems.
Deep expertise in backend development, API design, and cloud infrastructure (AWS, GCP, or Azure).
Solid grounding in SRE principles — including incident response, observability, error budgeting, and reliability metrics.
Strong knowledge of site reliability tooling (e.g., Prometheus, Grafana, OpenTelemetry, Sentry).
Familiarity with model serving frameworks (e.g., Triton, TorchServe, Ray Serve), and GPU compute orchestration.
Experience with CI/CD, Software Development Lifecycle for Software Systems, AI model lifecycle tooling, and infrastructure-as-code.
Bachelor's or Master's in Computer Science, Software Engineering, or equivalent.

Good to Have

Familiarity with LLM pipelines, streaming inference, or hybrid deployment environments (cloud + edge).
Prior ownership of large-scale AI delivery platforms or model hosting infrastructure.
Ability and willingness to learn any new technologies and apply them at work in order to stay ahead, in a fast paced, high pressure, agile environment.
Excellent problem-solving, analytical, and decision-making abilities.
Strong communication and stakeholder management skills.

Perks & Benefits

Global mission to revolutionize the way the world games
Opportunity to make an impact globally
Work across a global team located across 5 continents
Unique, gamer-centric #LifeAtRazer experience
Accelerated growth, both personally and professionally
Certified as a Great Place to Work® in both United States and Singapore

Job Description

Job Responsibilities:

We are seeking a Director of Software Engineering, AI Deployment with a strong SRE orientation to lead the software and systems engineering required to bring cutting-edge AI models to production. This role is responsible for engineering AI model services, building supporting infrastructure, and ensuring that the systems are scalable, resilient, observable, and production-ready.

You’ll oversee teams building backend APIs, automation tooling, and deployment pipelines, while also defining performance, availability, and reliability standards. This role operates at the intersection of software engineering, DevOps, and AI.

Essential Duties and Responsibilities

Lead the design and engineering of scalable, robust, and testable software systems that wrap and serve AI/ML models.
Drive development of reusable APIs, frameworks, and libraries to accelerate integration of AI into customer-facing products.
Oversee engineering of high-performance model inference systems, with a focus on both cloud-native and on-premise environments.
Architect backend services that are API-first, containerized, and designed for high availability
Ensure all services are testable, observable, and meet handoff criteria for release candidate testing by the QA team for continuous integration, automated validation, and smooth production rollout
Define and implement SLOs, SLIs, and error budgets for model-backed services.
Drive implementation of robust monitoring, alerting, logging, and auto-recovery mechanisms.
Build resilience and observability into AI systems by design and implement incident response protocols, runbooks, and reliability audits
Lead efforts to optimize AI model serving performance: memory, compute, GPU usage, latency, and cost-efficiency.
Architect systems that can scale elastically based on demand, while maintaining deterministic behavior and uptime guarantees.
Oversee buildout of deployment automation tools, CI/CD for models and software components, and rollback systems.
Manage and grow a team of software and systems engineers responsible for end-to-end AI system readiness.
Set strategy for software delivery, technical quality, operational metrics, and performance benchmarks.

Pre-Requisites:

Qualifications

10+ years in software engineering, with 4+ years in engineering leadership or director roles.
Demonstrated experience building and running production-grade AI/ML systems.
Deep expertise in backend development, API design, and cloud infrastructure (AWS, GCP, or Azure).
Solid grounding in SRE principles — including incident response, observability, error budgeting, and reliability metrics.
Strong knowledge of site reliability tooling (e.g., Prometheus, Grafana, OpenTelemetry, Sentry)
Familiarity with model serving frameworks (e.g., Triton, TorchServe, Ray Serve), and GPU compute orchestration
Familiarity with LLM pipelines, streaming inference, or hybrid deployment environments (cloud + edge)
Prior ownership of large-scale AI delivery platforms or model hosting infrastructure.
Experience with CI/CD, Software Development Lifecycle for Software Systems, AI model lifecycle tooling, and infrastructure-as-code.
Excellent problem-solving, analytical, and decision-making abilities.
Strong communication and stakeholder management skills.
Ability and willingness to learn any new technologies and apply them at work in order to stay ahead, in a fast paced, high pressure, agile environment
Excellent written and verbal communication skills for coordinating across teams.

Education & Experience

Bachelor's or Master's in Computer Science, Software Engineering, or equivalent

Travel Requirements

Role based in Singapore office and may require up to 1 travel trip per year.

13 Skills Required For This Role

Team Management Communication Forecasting Budgeting Game Texts Quality Control Agile Development Incident Response Aws Azure Model Serving Prometheus Grafana Ci Cd

Similar Jobs

Research Development

Software Engineer, BigQuery AI Developer Experience

Google • Kirkland, Washington, United States of America (On Site)