Home >

Jobs >

Site Reliability Engineer (Activate)

PubMatic

Maharashtra, India (Hybrid)

Site Reliability Engineer (Activate)

11 Months ago • 3-10 Years • Operations • Devops

Job Summary

Job Description

Site Reliability Engineer needed for large-scale distributed software applications. Must have 3+ years experience in software application/product support, programming in Go and scripting languages like Shell or Python. Experience with monitoring tools like Grafana, Nagios, Influx, and ELK required.

Must have:

Software Support
Go Programming
Shell/Python
Monitoring Tools

Good to have:

Technical Engineering
MySQL Database
Docker Orchestration
Zenduty Incident

11 skills required

11 skills required for this role

Add these skills to join the top 1% applicants for this job

shell

elk

nagios

grafana

python

docker

linux

incident-response

mysql

communication

networking

Job Details

About the job

As an SRE Engineer, you will be responsible for the Activate and Production Infrastructure. Your essential duties encompass ensuring the seamless operation and optimal performance of large-scale distributed software applications. Your role revolves around maintaining a robust and high-performing environment, contributing to the reliability of our services, and innovating solutions to guarantee 24/7 availability. By leveraging your technical expertise and dedication, you contribute to maintaining a seamless experience for our users while upholding the highest standards of operational excellence. Your specific responsibilities include:

Role and Responsibilities:

1. Monitoring and Alerting

a. Review existing and set up new monitoring tools and systems as needed to track system performance, key metrics.

2. Incident Management

a. monitor the alerts and logs to promptly identify incidents or anomalies.

b. Prioritize incidents based on severity and potential impact on stability and reliability.

c. Engage in effective incident resolution, applying necessary fixes and mitigations to restore normal operations.

3.On-Call Responsibilities

a. Organize on-call schedules to ensure 24/7 coverage for incident response.

b.Respond to alerts, troubleshoot issues, and coordinate with NOC and Engineering teams for incident resolution.

c. Conduct post-incident reviews to identify root causes, learn from incidents, and implement preventive measures.

4. Automation and Tooling

a.Review pre-existing and build new automation scripts and tools as needed to streamline repetitive tasks, enhance efficiency, and reduce manual errors.

b.Regularly update and maintain tools used for monitoring, deployment, and incident management to align with evolving needs.

5.Performance Optimization

a. Analyze application performance using profiling and monitoring tools to identify bottlenecks and areas for improvement.

b. Work on optimizations, infrastructure upgrades, and architectural improvements to enhance system performance and efficiency.

6.Capacity Planning and Scaling

a. Monitor resource utilization and trends to predict capacity needs and plan for scaling.

b. Scale resources, such as servers and databases, are based on usage patterns and anticipated growth to maintain performance and reliability. Also, automate the entire sizing process.

7. Disaster Recovery and Redundancy

a. Develop and maintain disaster recovery plans and procedures to ensure business continuity in case of failures or disasters.

b. Implement redundancy and failover strategies to minimize downtime and maintain service availability during failures.

8. Knowledge Sharing and Documentation

a. Create and maintain comprehensive documentation for configurations, procedures, incidents, and best practices.

b. Foster a culture of knowledge sharing within the team, conducting regular knowledge-sharing sessions and training programs.

9.Feedback Loop and Continuous Improvement

a. Collect feedback from incidents, post-mortems, and NOC/Dev team interactions to identify areas for improvement.

b. Continuously iterate on processes, tools, and systems based on feedback and lessons learned to drive continuous improvement.

10. Collaboration and Communication

a. Collaborate closely with Engineering and DC/NOC teams to align goals and priorities.

b. Ensure open and transparent communication within the team and with stakeholders, providing regular updates on incidents, progress, and initiatives.

Required Skills and Qualifications

Bachelor's degree in computer science or related disciplines

Total 3+ years' experience in software application/product support

Ability to program using programming languages like Go, Scripting languages like Shell or Python

Good to have prior experience in technical engineering

A proactive approach to identify the problems, performance bottlenecks, and areas of improvement

Must know, Networking, Database (MySQL) and Linux System concepts, Debugging and analyzing the core dumps

Hands-on experience with monitoring and observability tools like Grafana, Nagios, Influx, ELK, etc.

Familiarity with orchestration tools like Docker and Grafana and incident management systems like Zenduty

Excellent communication and collaboration skills, with the ability to work effectively across teams.

Self-motivated and positive mindset to examine any incidents

Similar Jobs

Live-Ops Specialist- Data Analysis & In-game Activities

Tencent

Shenzhen, Guangdong Province, China (On-Site)

• 1 Year ago

Customization and Verification Manager

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)

• 7 Months ago

Product Owner ( PAM )

Saviynt

Bengaluru, Karnataka, India (Hybrid)

• 10 Months ago

Senior Computer Systems Linux Engineer w/ Python

Luxoft

Bucharest, Bucharest, Romania (On-Site)

• 9 Months ago

Senior C Developer for Imunify360 (worldwide remote, work anywhere)

CloudLinux

Sofia City Province, Bulgaria (Remote)

• 8 Months ago

Junior Legal Specialist

Futurum Technology

Kraków, Lesser Poland Voivodeship, Poland (On-Site)

• 1 Year ago

Developer Support Engineer

Unity

Vilnius, Vilnius County, Lithuania (On-Site)

• 9 Months ago

Service Manager

Tesla

Vienna, Vienna, Austria (On-Site)

• 6 Months ago

Analyst Security Intelligence

Sphere Entertainment Co

Las Vegas, Nevada, United States (On-Site)

• 8 Months ago

Manager- Vendor Audits (Lending)

PhonePe

Bengaluru, Karnataka, India (On-Site)

• 10 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Integration Specialist

CAE

Tampa, Florida, United States (On-Site)

• 11 Months ago

Controls Engineer (m/f/d) - German speaker

Fluence

Berlin, Berlin, Germany (Hybrid)

• 11 Months ago

AWS Devops Engineer II - R-19974

Rackspace Technology

India (Remote)

• 9 Months ago

Senior Backend Engineer for Global Realistic 3A Action Game

Tencent

Shenzhen, Guangdong Province, China (On-Site)

• 8 Months ago

Technical Marketing Manager - Data Analytics Specialist

NVIDIA

Santa Clara, California, United States (On-Site)

• 7 Months ago

Research Scientist, Multimodality

ByteDance

San Jose, California, United States (On-Site)

• 10 Months ago

DevOps (Data DevOps) - Lead DevOps Engineer

Paytm

Noida, Uttar Pradesh, India (On-Site)

• 10 Months ago

PostgreSQL Developer with Oracle

Luxoft

Chennai, Tamil Nadu, India (On-Site)

• 9 Months ago

Design Verification Infrastructure Engineer

NVIDIA

Bengaluru, Karnataka, India (On-Site)

• 7 Months ago

Software Engineer, Machine Learning

Jobs in Pune, Maharashtra, India

Confirmations Analyst

Deutsche Bank

Bengaluru, Karnataka, India (Hybrid)

• 11 Months ago

Associate Manager, Technical Support

Rackspace Technology

Gurugram, Haryana, India (Remote)

• 9 Months ago

Hydrology and Hydraulic Engineer

Assystems

Bengaluru, Karnataka, India (On-Site)

• 10 Months ago

Social Media Manager

DRIFE

Bengaluru, Karnataka, India (On-Site)

• 11 Months ago

Senior Software Engineer – Tableau/Looker Admin - Data Platform

Warner Bros Games

Hyderabad, Telangana, India (Hybrid)

• 7 Months ago

Salesforce Technical Lead

Highspot

Hyderabad, Telangana, India (Hybrid)

• 11 Months ago

Talent Acquisition Lead (Volume Hiring ) - South Zone - Manager

Paytm

Hyderabad, Telangana, India (On-Site)

• 10 Months ago

Senior Security Engineer

DataVisor

India (Remote)

• 11 Months ago

Analyst - LCM - Mumbai - 764

ION

Mumbai, Maharashtra, India (On-Site)

• 11 Months ago

Sales Team Lead - Karnal- Oil & Gas

Paytm

Haryana, India (On-Site)

• 9 Months ago

Get notifed when new similar jobs are uploaded

Operations Jobs

Lulapay Operations Graduate

Lulalend

Cape Town, Western Cape, South Africa (On-Site)

• 6 Months ago

Park Operations Host - Full Time

The Walt Disney Company

Hong Kong (On-Site)

• 7 Months ago

Executive Assistant

Niantic

Tokyo, Japan (Hybrid)

• 11 Months ago

Change Coordinator

Rush Street Interactive

Tartu, Tartu County, Estonia (Remote)

• 7 Months ago

Customer Experience (CX) Global Special Ops Manager - APAC

Warner Bros Discovery

Petaling Jaya, Selangor, Malaysia (On-Site)

• 9 Months ago

Technical Compliance Specialist (Change Management)

Evolution

San José, San José Province, Costa Rica (On-Site)

• 10 Months ago

Indoor Operations Coordinator - Japan

Trackman

Tokyo, Tokyo, Japan (On-Site)

• 8 Months ago

B2B Account Executive (Cebu)

NinjaVan

Cebu City, Central Visayas, Philippines (Hybrid)

• 11 Months ago

Team Leader - Food & Beverage

Rank group

Croydon, England, United Kingdom (On-Site)

• 9 Months ago

Senior Procurement Manager EMEA

Unity

London, England, United Kingdom (On-Site)

• 10 Months ago

Get notifed when new similar jobs are uploaded

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

A global community of game builders. Helping people upskill and land jobs in the best gaming studios.

Company

Key Links

hello@outscal.com

Made in INDIA 💛💙