A Little About Us
Job Summary:
We are looking for a seasoned Senior Site Reliability Engineer (SRE) with advanced expertise in programming, cloud architecture, and automation. In this role, you will lead large-scale cloud and automation projects, collaborate closely with production engineering teams, and design robust, scalable cloud solutions. You will be instrumental in architecting cloud platforms, driving automation strategies, and ensuring the reliability, security, and scalability of infrastructure. This is a leadership role where you will also mentor junior engineers and drive initiatives across SRE and production engineering.
Responsibilities
Cloud Architecture & Project Leadership:
- Lead initiatives to enhance and optimize existing cloud infrastructure driving improvements in scalability and resilience through close collaboration with production engineering teams.
- Refine and deploy cloud infrastructure solutions that align with the product and technical roadmap, ensuring seamless integration and maximizing the efficiency of cloud services while minimizing operational disruptions.
- Oversee and manage large-scale projects related to cloud platforms, automation, and performance optimization, ensuring these solutions meet both immediate and long-term business objectives.
- Continuously assess and refine cloud architecture, focusing on security, scalability, and performance across cloud infrastructure, while collaborating with cross-functional teams to ensure operational excellence.
Programming & Development:
- Utilize advanced programming skills to develop and optimize tools for infrastructure management and automation on cloud platforms.
- Write, review, and maintain code in scripting languages (Python, JavaScript) and system programming languages (GoLang) for high-performance infrastructure.
- Collaborate with engineering teams to integrate SRE principles into the product lifecycle, improving both site reliability and product functionality across cloud platforms.
Automation & System Management:
-Develop and implement automation strategies to enhance system deployment, monitoring, and operational efficiency across cloud/on-prem environments.
- Design and manage CI/CD pipelines to improve the speed, reliability, and consistency of software delivery.
- Utilize infrastructure-as-code tools like Terraform, Ansible, and CloudFormation to automate cloud resource provisioning and management on Cloud Platform.
- Maintain and support production systems and associated infrastructure, ensuring their availability, performance, and scalability through continuous monitoring and automation.
Collaboration & Communication:
- Work closely with cross-functional teams to understand product and technical roadmaps, identifying potential impacts on system operability and proposing proactive solutions for Cloud environments.
- Provide timely and effective solutions to complex technical issues related to both system reliability and product improvements, leveraging Cloud capabilities.
Mentorship & Leadership:
- Mentor junior SREs, sharing best practices in cloud architecture, automation, and incident management.
- Foster cross-functional collaboration between development, infrastructure, and operations teams to improve the overall performance and reliability of services on cloud.
- Lead the effort to continuously improve the availability, scalability, and efficiency of systems across the cloud, driving innovation through monitoring, automation, and performance optimization initiatives.
Qualifications:
- B.E./B.Tech in Computer Science or a related field, or equivalent experience.
- A minimum of 6+ years of industry experience in site reliability engineering, system engineer, or a related role, ideally in large-scale environments, with a focus on supporting 24x7 highly-available systems.
- Advanced scripting skills in languages such as Python, Golang, Java, with the ability to write fully functional scripts/programs for automation and tool development.
- Hands-on experience with cloud platforms (AWS or GCP), with a strong understanding of cloud architecture best practices, deployment strategies, and scaling within these ecosystems.
- Deep knowledge of containerization and orchestration, particularly with Kubernetes, and practical experience managing large-scale containerized environments.
- Expertise in running infrastructure automation tools at scale, such as Git, Airflow, Jenkins, Screwdriver, for managing code deployments and continuous integration workflows.
- Proficient in Infrastructure as Code (IaC) tools, such as Ansible, Chef, Terraform, with experience automating infrastructure at scale.
- Understanding of DevOps methodologies and practices, promoting collaboration between development and operations teams for improved service delivery.
- Experience with monitoring and logging solutions, enabling proactive identification and resolution of issues.
- Familiarity with incident and change management frameworks, such as ITIL or other industry standards.
- Proven experience in mentoring and developing junior SREs.
- Demonstrated ability to provide technical leadership in incident, change, and problem management activities, especially in high-pressure, production environments.
Bonus Points:
- Cloud certifications (e.g., AWS Certified Solutions Architect, Google Professional Cloud Engineer).
- Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD).
- Programming language certification ( e.g. PCAP – Certified Associate in Python Programming, Oracle Certified Associate, Java SE 8/11 Programmer (OCAJP) ,Go Developer Certification (Golang))
Yahoo is proud to be an equal opportunity workplace. All qualified applicants will receive consideration for employment without regard to, and will not be discriminated against based on age, race, gender, color, religion, national origin, sexual orientation, gender identity, veteran status, disability or any other protected category. Yahoo will consider for employment qualified applicants with criminal histories in a manner consistent with applicable law. Yahoo is dedicated to providing an accessible environment for all candidates during the application process and for employees during their employment. If you need accessibility assistance and/or a reasonable accommodation due to a disability, please submit a request via the Accommodation Request Form (www.yahooinc.com/careers/contact-us.html) or call +1.866.772.3182. Requests and calls received for non-disability related issues, such as following up on an application, will not receive a response.
Yahoo has a high degree of flexibility around employee location and hybrid working. In fact, our flexible-hybrid approach to work is one of the things our employees rave about. Most roles don’t require specific regular patterns of in-person office attendance. If you join Yahoo, you may be asked to attend (or travel to attend) on-site work sessions, team-building, or other in-person events. When these occur, you’ll be given notice to make arrangements.
If you’re curious about how this factors into this role, please discuss with the recruiter.
Currently work for Yahoo? Please apply on our internal career site.