Software Engineer - AI/ML - Infrastructure Engineer

2 Hours ago • 2-4 Years
Research Development

Job Description

As an AI ML Engineer in the SRE and Observability team at GoTo, you will build intelligent systems that detect anomalies, reduce incidents, and accelerate root cause analysis. Your work will directly improve reliability across platforms, helping teams resolve issues faster and keep services running smoothly. You will apply ML techniques to correlate metrics, logs, and traces, and design automation that prevents recurring issues. By embedding AI into daily operations, you will enable self learning systems and empower engineers to focus on higher impact problems. If you are excited about applying AI to real world reliability challenges, this role is for you.
Must Have:
  • Build and deploy ML models to identify anomalies across metrics, logs, and traces before they cause incidents.
  • Correlate observability signals and analyze historical incidents to accelerate RCA and provide resolution recommendations.
  • Develop AI agents that autonomously troubleshoot, recommend actions, and improve incident response.
  • Connect AI insights with ticketing, alerting, and incident management platforms for seamless workflows.
  • Partner with SRE, Monitoring, and Security teams to embed AI driven practices into daily operations.
  • 2-4 years experience in machine learning, deep learning, and data analysis, with focus on anomaly detection and NLP.
  • At least 2 years experience in Site Reliability Engineering, DevOps, or cloud infrastructure roles.
  • Proficiency in scripting or programming (e.g., Python, Go, or Bash) for automation and tooling.
  • Understanding of observability tools like Prometheus, Grafana, ELK, or similar logging/monitoring stacks.
  • Strong problem-solving skills with a focus on performance tuning, reliability, and incident response.
  • Excellent communication and collaboration skills, with the ability to work effectively across cross-functional teams.

Add these skills to join the top 1% applicants for this job

cross-functional
communication
data-analytics
talent-acquisition
game-texts
incident-response
prometheus
grafana
elk
deep-learning
python
bash
machine-learning

About The Role:

As an AI ML Engineer in the SRE and Observability team at GoTo, you will build intelligent systems that detect anomalies, reduce incidents, and accelerate root cause analysis. Your work will directly improve reliability across platforms, helping teams resolve issues faster and keep services running smoothly. You will apply ML techniques to correlate metrics, logs, and traces, and design automation that prevents recurring issues. By embedding AI into daily operations, you will enable self learning systems and empower engineers to focus on higher impact problems. If you are excited about applying AI to real world reliability challenges, this role is for you.

What Will You Do

  • AI Driven Incident Detection – Build and deploy ML models to identify anomalies across metrics, logs, and traces before they cause incidents.
  • Root Cause Analysis Automation – Correlate observability signals and analyze historical incidents to accelerate RCA and provide resolution recommendations.
  • Agentic AI Solutions – Develop AI agents that autonomously troubleshoot, recommend actions, and improve incident response.
  • Integration with Operations Tools – Connect AI insights with ticketing, alerting, and incident management platforms for seamless workflows.
  • Collaboration and Enablement – Partner with SRE, Monitoring, and Security teams to embed AI driven practices into daily operations.

What Will You Need

  • 2-4 years experience in machine learning, deep learning, and data analysis, with focus on anomaly detection and NLP.
  • At least 2 years experience in Site Reliability Engineering, DevOps, or cloud infrastructure roles.
  • Tools: n8n, Langgraph/crewAI, Langchain etc.
  • Proficiency in scripting or programming (e.g., Python, Go, or Bash) for automation and tooling.
  • Understanding of observability tools like Prometheus, Grafana, ELK, or similar logging/monitoring stacks.
  • Strong problem-solving skills with a focus on performance tuning, reliability, and incident response.
  • Excellent communication and collaboration skills, with the ability to work effectively across cross-functional teams.

About The Team:

You'll be part of SRE and Observability team which ensures the reliability, performance, and scalability of GoTo’s critical platforms. We build and operate systems that collect, analyze, and act on observability signals across metrics, logs, traces, and profiles. By bringing AI and automation into this space, we are solving the challenge of reducing noise, detecting incidents faster, and accelerating root cause analysis. Joining this team means working at the intersection of AI, reliability, and large scale systems, where your contributions directly impact millions of users and the engineers who keep our platforms running.

About GoTo Group

GoTo Group is the largest digital ecosystem in Indonesia with its mission to “Empower Progress’ by offering technological infrastructure and solutions for everyone to access and thrive in the digital economy. The GoTo ecosystem consists of on-demand transportation services, food and grocery delivery, logistics and fulfillment, as well as financial and payment services through the Gojek and GoTo Financial platforms.It is the first platform in Southeast Asia that hosts these crucial cases in a single ecosystem, capturing the majority of Indonesia’s vast consumer household.

About Gojek

Gojek is Southeast Asia’s leading on-demand platform and pioneer of the multi-service ecosystem with over 2.5 million driver partners across the regions offering a wide range of services such as transportation, food delivery, logistics and more. With its mission to create impact at scale, Gojek is committed to resolving consumer problems and raising standards of living by connecting consumers to the best providers of goods and services in the market.

About GoTo Financial

GoTo Financial accelerates financial inclusion through its leading financial services and merchants solutions. Its consumer services include GoPay and GoPayLater and serve businesses of all sizes through Midtrans, Moka, GoBiz Plus, GoBiz, and Selly. With its trusted and inclusive ecosystem of products, GoTo Financial is open to new growth opportunities and aims to empower everyone to Make It Happen, Make It Together, Make It Last.

GoTo and its business units, including Gojek and GoToFinancial ("GoTo") only post job opportunities on our official channels on our respective company websites and on LinkedIn. GoTo is not liable for any job postings or job offers that did not originate from us. You should conduct your own due diligence to prevent being victims of any fake job scams, if they did not originate from GoTo's official recruitment channels.

Set alerts for more jobs like Software Engineer - AI/ML - Infrastructure Engineer
Set alerts for new jobs by GoTo Group
Set alerts for new Research Development jobs in India
Set alerts for new jobs in India
Set alerts for Research Development (Remote) jobs

Contact Us
hello@outscal.com
Made in INDIA 💛💙