Reliability Engineer, Ai & Data Platforms

55 Minutes ago • 3 Years + • $147,400 PA - $220,900 PA
Research Development

Job Description

Join the AI and Data Platforms team to build and manage cloud-based data platforms handling petabytes of data. We seek a passionate Reliability Engineer with strong data and ML systems understanding. You will develop and operate large-scale big data platforms, optimize performance and cost, automate operations, and resolve production issues to ensure a reliable data platform experience for critical applications like analytics, reporting, and AI/ML apps.
Good To Have:
  • Expertise in designing, building, and operating critical, large-scale distributed systems with a focus on low latency, fault-tolerance, and high availability.
  • Experience with contribution to Open Source projects is a plus.
  • Experience with multiple public cloud infrastructure, managing multi-tenant Kubernetes clusters at scale and debugging Kubernetes/Spark issues.
  • Experience with workflow and data pipeline orchestration tools (e.g., Airflow, DBT).
  • Understanding of data modeling and data warehousing concepts.
  • Familiarity with the AI/ML stack, including GPUs, MLFlow, or Large Language Models (LLMs).
  • A learning attitude to continuously improve the self, team, and the organization.
  • Solid understanding of software engineering best practices, including the full development lifecycle, secure coding, and experience building reusable frameworks or libraries.
Must Have:
  • Develop and operate large-scale big data platforms using open source and other solutions.
  • Support critical applications including analytics, reporting, and AI/ML apps.
  • Optimize platform performance and cost efficiency.
  • Automate operational tasks for big data systems.
  • Identify and resolve production errors and issues to ensure platform reliability and user experience.
  • 3+ years of professional software engineering experience with large-scale big data platforms, including strong programming skills in Java, Scala, Python, or Go.
  • Proven expertise in designing, building, and operating large-scale distributed data processing systems with a strong focus on Apache Spark.
  • Hands-on experience with table formats and data lake technologies such as Apache Iceberg, ensuring scalability, reliability, and optimized query performance.
  • Skilled at coding for distributed systems and developing resilient data pipelines.
  • Strong background in incident management, including troubleshooting, root cause analysis, and performance optimization in complex production environments.
  • Proficient with Unix/Linux systems and command-line tools for debugging and operational support.
Perks:
  • Opportunity to progress as you grow and develop within a role.
  • Opportunity to become a shareholder through participation in discretionary employee stock programs.
  • Eligible for discretionary restricted stock unit awards.
  • Can purchase stock at a discount if voluntarily participating in the Employee Stock Purchase Plan.
  • Comprehensive medical and dental coverage.
  • Retirement benefits.
  • A range of discounted products and free services.
  • Reimbursement for certain educational expenses — including tuition for formal education related to advancing your career.
  • Might be eligible for discretionary bonuses or commission payments.
  • Might be eligible for relocation.

Add these skills to join the top 1% applicants for this job

problem-solving
communication
data-analytics
game-texts
html
user-experience-ux
linux
unix
spark
kubernetes
python
scala
java

Join the AI and Data Platforms team, where we build and manage cloud-based data platforms handling petabytes of data at scale. We are looking for a passionate and independent Software Engineer specializing in reliability engineering for data platforms, with a strong understanding of data and ML systems. If you thrive in a fast-paced environment, love crafting solutions that don't yet exist, and possess excellent communication skills to collaborate across diverse teams, we invite you to contribute to high standards in an exciting and dynamic setting.

As part of our team, you will be responsible for developing and operating our big data platform using open source or other solutions to aid critical applications, such as analytics, reporting, and AI/ML apps. This includes working to optimize performance and cost, automate operations, and identifying and resolving production errors and issues to ensure the best data platform experience.

  • Develop and operate large-scale big data platforms using open source and other solutions.
  • Support critical applications including analytics, reporting, and AI/ML apps.
  • Optimize platform performance and cost efficiency.
  • Automate operational tasks for big data systems.
  • Identify and resolve production errors and issues to ensure platform reliability and user experience
  • 3+ years of professional software engineering experience with large-scale big data platforms, including strong programming skills in Java, Scala, Python, or Go.
  • Proven expertise in designing, building, and operating large-scale distributed data processing systems with a strong focus on Apache Spark.
  • Hands-on experience with table formats and data lake technologies such as Apache Iceberg, ensuring scalability, reliability, and optimized query performance.
  • Skilled at coding for distributed systems and developing resilient data pipelines.
  • Strong background in incident management, including troubleshooting, root cause analysis, and performance optimization in complex production environments.
  • Proficient with Unix/Linux systems and command-line tools for debugging and operational support.
  • Expertise in designing, building, and operating critical, large-scale distributed systems with a focus on low latency, fault-tolerance, and high availability.
  • Experience with contribution to Open Source projects is a plus.
  • Experience with multiple public cloud infrastructure, managing multi-tenant Kubernetes clusters at scale and debugging Kubernetes/Spark issues.
  • Experience with workflow and data pipeline orchestration tools (e.g., Airflow, DBT).
  • Understanding of data modeling and data warehousing concepts.
  • Familiarity with the AI/ML stack, including GPUs, MLFlow, or Large Language Models (LLMs).
  • A learning attitude to continuously improve the self, team, and the organization.
  • Solid understanding of software engineering best practices, including the full development lifecycle, secure coding, and experience building reusable frameworks or libraries.

Base pay is one part of our total compensation package and is determined within a range. This provides the opportunity to progress as you grow and develop within a role. The base pay range for this role is between $147,400 and $220,900, and your base pay will depend on your skills, qualifications, experience, and location.

Employees also have the opportunity to become a shareholder through participation in discretionary employee stock programs. Employees are eligible for discretionary restricted stock unit awards, and can purchase stock at a discount if voluntarily participating in the Employee Stock Purchase Plan. You’ll also receive benefits including: Comprehensive medical and dental coverage, retirement benefits, a range of discounted products and free services, and for formal education related to advancing your career, reimbursement for certain educational expenses — including tuition. Additionally, this role might be eligible for discretionary bonuses or commission payments as well as relocation. Learn more about Benefits.

Note: Benefit, compensation and employee stock programs are subject to eligibility requirements and other terms of the applicable plan or program.

This is an equal opportunity employer that is committed to inclusion and diversity. We seek to promote equal opportunity for all applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or other legally protected characteristics. Learn more about your EEO rights as an applicant

Set alerts for more jobs like Reliability Engineer, Ai & Data Platforms
Set alerts for new jobs by Apple
Set alerts for new Research Development jobs in United States
Set alerts for new jobs in United States
Set alerts for Research Development (Remote) jobs

Contact Us
hello@outscal.com
Made in INDIA 💛💙