Site Reliability Engineer II

1 Week ago • 6 Years +

About the job

SummaryBy Outscal

Seeking a Senior Site Reliability Engineer with 6+ years of experience in large-scale cloud environments. Expertise in cloud architecture, automation, programming (Python, Golang), and Kubernetes is crucial. You'll lead cloud projects, mentor junior engineers, and drive initiatives across SRE and production engineering.
It takes powerful technology to connect our brands and partners with an audience of hundreds of millions of people. Whether you’re looking to write mobile app code, engineer the servers behind our massive ad tech stacks, or develop algorithms to help us process trillions of data points a day, what you do here will have a huge impact on our business—and the world.

A Little About Us

Yahoo is a global media and tech company that connects people to their passions. We reach nearly 900 Million people around the world, bringing them closer to what they love—from finance and sports, to shopping, gaming and news—with the trusted products, content and tech that fuel their day. For partners, we provide a full-stack platform for businesses to amplify growth and drive more meaningful connections across advertising, search, and media.Our Site Reliability Engineers (SRE) sit under the office of the CTO as a critical part of the Yahoo Platforms Engineering (YPE) group.  YPE enables all of the Yahoo verticals to develop and deliver the best possible products to our customers globally with the highest standards and with modern, secure, and efficient software platforms to power all of Yahoo’s brands at scale. For SRE, ensuring the availability and reliability of our platforms is at the heart of what we do.We are looking to expand our SRE team with experienced engineers seeking an opportunity to develop their careers in a company undergoing one of the largest technology transformations ever. We are looking for individuals who value collaboration, have solid communication skills, and have a passion for creating quality solutions to support our products. 
As a member of our team, you will be responsible for ensuring the stability of Yahoo’s products and infrastructure alongside other DevOps teams. We operate services at a scale covering our entire software, system, and network footprint across our data centers and multiple cloud service providers. We encourage new ideas, and continuously experiment and evaluate new technologies. Our team structure encourages trust, learning from one another, having fun, and supporting passionate people about what they do.

Job Summary:

We are looking for a seasoned Senior Site Reliability Engineer (SRE) with advanced expertise in programming, cloud architecture, and automation. In this role, you will lead large-scale cloud and automation projects, collaborate closely with production engineering teams, and design robust, scalable cloud solutions. You will be instrumental in architecting cloud platforms, driving automation strategies, and ensuring the reliability, security, and scalability of infrastructure. This is a leadership role where you will also mentor junior engineers and drive initiatives across SRE and production engineering.

Responsibilities

Cloud Architecture & Project Leadership:

- Lead initiatives to enhance and optimize existing cloud infrastructure driving improvements in scalability and resilience through close collaboration with production engineering teams.

- Refine and deploy cloud infrastructure solutions that align with the product and technical roadmap, ensuring seamless integration and maximizing the efficiency of cloud services while minimizing operational disruptions.

- Oversee and manage large-scale projects related to cloud platforms, automation, and performance optimization, ensuring these solutions meet both immediate and long-term business objectives.

- Continuously assess and refine cloud architecture, focusing on security, scalability, and performance across cloud infrastructure, while collaborating with cross-functional teams to ensure operational excellence.

Programming & Development:

- Utilize advanced programming skills to develop and optimize tools for infrastructure management and automation on cloud platforms.

- Write, review, and maintain code in scripting languages (Python, JavaScript) and system programming languages (GoLang) for high-performance infrastructure.

- Collaborate with engineering teams to integrate SRE principles into the product lifecycle, improving both site reliability and product functionality across cloud platforms.

Automation & System Management:

-Develop and implement automation strategies to enhance system deployment, monitoring, and operational efficiency across cloud/on-prem environments.

- Design and manage CI/CD pipelines to improve the speed, reliability, and consistency of software delivery.

- Utilize infrastructure-as-code tools like Terraform, Ansible, and CloudFormation to automate cloud resource provisioning and management on Cloud Platform.

- Maintain and support production systems and associated infrastructure, ensuring their availability, performance, and scalability through continuous monitoring and automation.

Collaboration & Communication:

- Work closely with cross-functional teams to understand product and technical roadmaps, identifying potential impacts on system operability and proposing proactive solutions for Cloud environments.

- Provide timely and effective solutions to complex technical issues related to both system reliability and product improvements, leveraging Cloud capabilities.

Mentorship & Leadership:

- Mentor junior SREs, sharing best practices in cloud architecture, automation, and incident management.

- Foster cross-functional collaboration between development, infrastructure, and operations teams to improve the overall performance and reliability of services on cloud.

- Lead the effort to continuously improve the availability, scalability, and efficiency of systems across the cloud, driving innovation through monitoring, automation, and performance optimization initiatives.

Qualifications:

- B.E./B.Tech in Computer Science or a related field, or equivalent experience.

- A minimum of 6+ years of industry experience  in site reliability engineering, system engineer, or a related role, ideally in large-scale environments, with a focus on supporting 24x7 highly-available systems.

- Advanced scripting skills in languages such as Python, Golang, Java, with the ability to write fully functional scripts/programs for automation and tool development.

- Hands-on experience with cloud platforms (AWS or GCP), with a strong understanding of cloud architecture best practices, deployment strategies, and scaling within these ecosystems.

- Deep knowledge of containerization and orchestration, particularly with Kubernetes, and practical experience managing large-scale containerized environments.

- Expertise in running infrastructure automation tools at scale, such as Git, Airflow, Jenkins, Screwdriver, for managing code deployments and continuous integration workflows.

- Proficient in Infrastructure as Code (IaC) tools, such as Ansible, Chef, Terraform, with experience automating infrastructure at scale.

- Understanding of DevOps methodologies and practices, promoting collaboration between development and operations teams for improved service delivery.

- Experience with monitoring and logging solutions, enabling proactive identification and resolution of issues.

- Familiarity with incident and change management frameworks, such as ITIL or other industry standards.

- Proven experience in mentoring and developing junior SREs.

- Demonstrated ability to provide technical leadership in incident, change, and problem management activities, especially in high-pressure, production environments.

Bonus Points:

- Cloud certifications (e.g., AWS Certified Solutions Architect, Google Professional Cloud Engineer).

- Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD).

- Programming language certification ( e.g. PCAP – Certified Associate in Python Programming, Oracle Certified Associate, Java SE 8/11 Programmer (OCAJP) ,Go Developer Certification (Golang))

Yahoo is proud to be an equal opportunity workplace. All qualified applicants will receive consideration for employment without regard to, and will not be discriminated against based on age, race, gender, color, religion, national origin, sexual orientation, gender identity, veteran status, disability or any other protected category. Yahoo will consider for employment qualified applicants with criminal histories in a manner consistent with applicable law. Yahoo is dedicated to providing an accessible environment for all candidates during the application process and for employees during their employment. If you need accessibility assistance and/or a reasonable accommodation due to a disability, please submit a request via the Accommodation Request Form (www.yahooinc.com/careers/contact-us.html) or call +1.866.772.3182. Requests and calls received for non-disability related issues, such as following up on an application, will not receive a response.

Yahoo has a high degree of flexibility around employee location and hybrid working. In fact, our flexible-hybrid approach to work is one of the things our employees rave about. Most roles don’t require specific regular patterns of in-person office attendance. If you join Yahoo, you may be asked to attend (or travel to attend) on-site work sessions, team-building, or other in-person events. When these occur, you’ll be given notice to make arrangements. 

If you’re curious about how this factors into this role, please discuss with the recruiter.

Currently work for Yahoo? Please apply on our internal career site.

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug