Site Reliability Engineer - Big Data (7 to 11 years)

19 Minutes ago • 7-11 Years
Devops

Job Description

This role is responsible for managing and maintaining complex, distributed big data ecosystems. It ensures the reliability, scalability, and security of large-scale production infrastructure. Key responsibilities include automating processes, optimizing workflows, troubleshooting production issues, and driving system improvements across multiple business verticals. The role involves leading on-call rotations, designing automation, resolving production issues, and ensuring system availability and performance.
Good To Have:
  • Experience managing infrastructure on public cloud platforms (AWS, Azure, GCP).
  • Experience in designing and reviewing system architectures for scalability and reliability.
  • Experience with observability tools to visualize and alert on system performance.
Must Have:
  • Manage, maintain, and support incremental changes to Linux/Unix environments.
  • Lead on-call rotations and incident responses, conducting root cause analysis and driving postmortem processes.
  • Design and implement automation systems for managing big data infrastructure.
  • Troubleshoot and resolve complex production issues while identifying root causes.
  • Design and review scalable and reliable system architectures.
  • Collaborate with teams to optimize overall system performance.
  • Enforce security standards across systems and infrastructure.
  • Set technical direction, drive standardization, and operate independently.
  • Ensure availability, performance, and scalability of systems and services.
  • Resolve, analyze, and respond to system outages and disruptions.
  • Develop tools and scripts to automate operational processes.
  • Monitor and optimize system performance and resource usage.
  • Collaborate with development teams to integrate best practices for reliability, scalability, and performance.
  • Stay informed of industry technology trends and innovations.
  • Develop and enforce SRE best practices and principles.
  • Align across functional teams on priorities and deliverables.
  • Drive automation to enhance operational efficiency.
  • Over 7 years of experience managing and maintaining distributed big data ecosystems.
  • Strong expertise in Linux including IP, Iptables, and IPsec.
  • Proficiency in scripting/programming with languages like Perl, Golang, or Python.
  • Hands-on experience with the Hadoop stack (HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot).
  • Familiarity with open-source configuration management and deployment tools such as Puppet, Salt, Chef, or Ansible.
  • Solid understanding of networking, open-source technologies, and related tools.
  • Experience with DevOps tools: Saltstack, Ansible, docker, Git.
  • Experience with SRE Logging and monitoring tools: ELK stack, Grafana, Prometheus, opentsdb, Open Telemetry.
Perks:
  • Medical Insurance
  • Critical Illness Insurance
  • Accidental Insurance
  • Life Insurance
  • Employee Assistance Program
  • Onsite Medical Center
  • Emergency Support System
  • Maternity Benefit
  • Paternity Benefit Program
  • Adoption Assistance Program
  • Day-care Support Program
  • Relocation benefits
  • Transfer Support Policy
  • Travel Policy
  • Employee PF Contribution
  • Flexible PF Contribution
  • Gratuity
  • NPS
  • Leave Encashment
  • Higher Education Assistance
  • Car Lease
  • Salary Advance Policy

Add these skills to join the top 1% applicants for this job

communication
problem-solving
data-analytics
github
talent-acquisition
game-texts
networking
hbase
linux
aws
unix
azure
prometheus
ansible
grafana
chef
elk
puppet
hadoop
yarn
docker
git
python
perl

About PhonePe Limited:

Headquartered in India, its flagship product, the PhonePe digital payments app, was launched in Aug 2016. As of April 2025, PhonePe has over 60 Crore (600 Million) registered users and a digital payments acceptance network spread across over 4 Crore (40+ million) merchants. PhonePe also processes over 33 Crore (330+ Million) transactions daily with an Annualized Total Payment Value (TPV) of over INR 150 lakh crore.

PhonePe’s portfolio of businesses includes the distribution of financial products (Insurance, Lending, and Wealth) as well as new consumer tech businesses (Pincode - hyperlocal e-commerce and Indus AppStore Localized App Store for the Android ecosystem) in India, which are aligned with the company’s vision to offer every Indian an equal opportunity to accelerate their progress by unlocking the flow of money and access to services.

Culture:

At PhonePe, we go the extra mile to make sure you can bring your best self to work, Everyday!. And that starts with creating the right environment for you. We empower people and trust them to do the right thing. Here, you own your work from start to finish, right from day one. PhonePe-rs solve complex problems and execute quickly; often building frameworks from scratch. If you’re excited by the idea of building platforms that touch millions, ideating with some of the best minds in the country and executing on your dreams with purpose and speed, join us!

About the Role:

This role is responsible for managing and maintaining complex, distributed big data ecosystems. It ensures the reliability, scalability, and security of large-scale production infrastructure. Key responsibilities include automating processes, optimizing workflows, troubleshooting production issues, and driving system improvements across multiple business verticals.

Roles and Responsibilities:

  • Manage, maintain, and support incremental changes to Linux/Unix environments.
  • Lead on-call rotations and incident responses, conducting root cause analysis and driving postmortem processes.
  • Design and implement automation systems for managing big data infrastructure, including provisioning, scaling, upgrades, and patching clusters.
  • Troubleshoot and resolve complex production issues while identifying root causes and implementing mitigating strategies.
  • Design and review scalable and reliable system architectures.
  • Collaborate with teams to optimize overall system performance.
  • Enforce security standards across systems and infrastructure.
  • Set technical direction, drive standardization, and operate independently.
  • Ensure availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning.
  • Resolve, analyze, and respond to system outages and disruptions and implement measures to prevent similar incidents from recurring.
  • Develop tools and scripts to automate operational processes, reducing manual workload, increasing efficiency and improving system resilience.
  • Monitor and optimize system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning.
  • Collaborate with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle.
  • Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities.
  • Develop and enforce SRE best practices and principles.
  • Align across functional teams on priorities and deliverables.
  • Drive automation to enhance operational efficiency.

Skills Required:

  • Over 7 years of experience managing and maintaining distributed big data ecosystems.
  • Strong expertise in Linux including IP, Iptables, and IPsec.
  • Proficiency in scripting/programming with languages like Perl, Golang, or Python.
  • Hands-on experience with the Hadoop stack (HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot).
  • Familiarity with open-source configuration management and deployment tools such as Puppet, Salt, Chef, or Ansible.
  • Solid understanding of networking, open-source technologies, and related tools.
  • Excellent communication and collaboration skills.
  • DevOps tools: Saltstack, Ansible, docker, Git.
  • SRE Logging and monitoring tools: ELK stack, Grafana, Prometheus, opentsdb, Open Telemetry.

Good to Have:

  • Experience managing infrastructure on public cloud platforms (AWS, Azure, GCP).
  • Experience in designing and reviewing system architectures for scalability and reliability.
  • Experience with observability tools to visualize and alert on system performance.

PhonePe Full Time Employee Benefits (Not applicable for Intern or Contract Roles)

  • Insurance Benefits - Medical Insurance, Critical Illness Insurance, Accidental Insurance, Life Insurance
  • Wellness Program - Employee Assistance Program, Onsite Medical Center, Emergency Support System
  • Parental Support - Maternity Benefit, Paternity Benefit Program, Adoption Assistance Program, Day-care Support Program
  • Mobility Benefits - Relocation benefits, Transfer Support Policy, Travel Policy
  • Retirement Benefits - Employee PF Contribution, Flexible PF Contribution, Gratuity, NPS, Leave Encashment
  • Other Benefits - Higher Education Assistance, Car Lease, Salary Advance Policy

Our inclusive culture promotes individual expression, creativity, innovation, and achievement and in turn helps us better understand and serve our customers. We see ourselves as a place for intellectual curiosity, ideas and debates, where diverse perspectives lead to deeper understanding and better quality results. PhonePe is an equal opportunity employer and is committed to treating all its employees and job applicants equally; regardless of gender, sexual preference, religion, race, color or disability. If you have a disability or special need that requires assistance or reasonable accommodation, during the application and hiring process, including support for the interview or onboarding process, please fill out this form.

Read more about PhonePe on our blog._

Life at PhonePe

PhonePe in the news

Set alerts for more jobs like Site Reliability Engineer - Big Data (7 to 11 years)
Set alerts for new jobs by PhonePe
Set alerts for new Devops jobs in India
Set alerts for new jobs in India
Set alerts for Devops (Remote) jobs

Contact Us
hello@outscal.com
Made in INDIA 💛💙