Site Reliability Engineer (Activate)

4 Months ago • 3-10 Years • Operations • DevOps

Job Summary

Job Description

Site Reliability Engineer needed for large-scale distributed software applications. Must have 3+ years experience in software application/product support, programming in Go and scripting languages like Shell or Python. Experience with monitoring tools like Grafana, Nagios, Influx, and ELK required.
Must have:
  • Software Support
  • Go Programming
  • Shell/Python
  • Monitoring Tools
Good to have:
  • Technical Engineering
  • MySQL Database
  • Docker Orchestration
  • Zenduty Incident

Job Details

About the job

As an SRE Engineer, you will be responsible for the Activate and Production Infrastructure. Your essential duties encompass ensuring the seamless operation and optimal performance of large-scale distributed software applications. Your role revolves around maintaining a robust and high-performing environment, contributing to the reliability of our services, and innovating solutions to guarantee 24/7 availability. By leveraging your technical expertise and dedication, you contribute to maintaining a seamless experience for our users while upholding the highest standards of operational excellence. Your specific responsibilities include:


Role and Responsibilities:


1. Monitoring and Alerting

a. Review existing and set up new monitoring tools and systems as needed to track system performance, key metrics.

2. Incident Management

a. monitor the alerts and logs to promptly identify incidents or anomalies.

b. Prioritize incidents based on severity and potential impact on stability and reliability.

c. Engage in effective incident resolution, applying necessary fixes and mitigations to restore normal operations.

3.On-Call Responsibilities

a. Organize on-call schedules to ensure 24/7 coverage for incident response.

b.Respond to alerts, troubleshoot issues, and coordinate with NOC and Engineering teams for incident resolution.

c. Conduct post-incident reviews to identify root causes, learn from incidents, and implement preventive measures.

4. Automation and Tooling

a.Review pre-existing and build new automation scripts and tools as needed to streamline repetitive tasks, enhance efficiency, and reduce manual errors.

b.Regularly update and maintain tools used for monitoring, deployment, and incident management to align with evolving needs.

5.Performance Optimization

a. Analyze application performance using profiling and monitoring tools to identify bottlenecks and areas for improvement.

b. Work on optimizations, infrastructure upgrades, and architectural improvements to enhance system performance and efficiency.

6.Capacity Planning and Scaling

a. Monitor resource utilization and trends to predict capacity needs and plan for scaling.

b. Scale resources, such as servers and databases, are based on usage patterns and anticipated growth to maintain performance and reliability. Also, automate the entire sizing process.

7. Disaster Recovery and Redundancy

a. Develop and maintain disaster recovery plans and procedures to ensure business continuity in case of failures or disasters.

b. Implement redundancy and failover strategies to minimize downtime and maintain service availability during failures.

8. Knowledge Sharing and Documentation

a. Create and maintain comprehensive documentation for configurations, procedures, incidents, and best practices.

b. Foster a culture of knowledge sharing within the team, conducting regular knowledge-sharing sessions and training programs.

9.Feedback Loop and Continuous Improvement

a. Collect feedback from incidents, post-mortems, and NOC/Dev team interactions to identify areas for improvement.

b. Continuously iterate on processes, tools, and systems based on feedback and lessons learned to drive continuous improvement.

10. Collaboration and Communication

a. Collaborate closely with Engineering and DC/NOC teams to align goals and priorities.

b. Ensure open and transparent communication within the team and with stakeholders, providing regular updates on incidents, progress, and initiatives.

Required Skills and Qualifications

  • Bachelor's degree in computer science or related disciplines
  • Total 3+ years' experience in software application/product support
  • Ability to program using programming languages like Go, Scripting languages like Shell or Python
  • Good to have prior experience in technical engineering
  • A proactive approach to identify the problems, performance bottlenecks, and areas of improvement
  • Must know, Networking, Database (MySQL) and Linux System concepts, Debugging and analyzing the core dumps
  • Hands-on experience with monitoring and observability tools like Grafana, Nagios, Influx, ELK, etc.
  • Familiarity with orchestration tools like Docker and Grafana and incident management systems like Zenduty
  • Excellent communication and collaboration skills, with the ability to work effectively across teams.
  • Self-motivated and positive mindset to examine any incidents

Similar Jobs

The Embassy - Pipeline Developer

The Embassy

Vancouver, British Columbia, Canada (Hybrid)
• 9 Hours ago
Saviynt - Technical Lead, Professional Services - NA

Saviynt

Bengaluru, Karnataka, India (Hybrid)
• 4 Months ago
Wolters Kluwer - Lead Product Software Engineer -  Lead Cloud Data Engineer

Wolters Kluwer

Coppell, Texas, United States (Hybrid)
• 4 Months ago
ION - IT System Administrator

ION

Italy (Hybrid)
• 4 Months ago
Centripetal - Cyber Data Scientist

Centripetal

Portsmouth, New Hampshire, United States (On-Site)
• 6 Months ago
The Walt Disney Company - Manager, Distribution Operations - Dubbing

The Walt Disney Company

Burbank, California, United States (On-Site)
• 3 Weeks ago
Sporty Group - Product Director - Sportsbook

Sporty Group

(Remote)
• 3 Months ago
Evolution - Austrian German Speaking Game Presenter

Evolution

Birkirkara, Malta (On-Site)
• 9 Months ago
Cat Daddy - People Operations Coordinator

Cat Daddy

Kirkland, Washington, United States (On-Site)
• 1 Day ago
Playtech - Mandarin Dealer

Playtech

Magdalena Del Mar, Lima Province, Peru (On-Site)
• 3 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

ByteDance - SRE and DevOps Tech Lead - Edge Cloud Infrastructure - London

ByteDance

London, England, United Kingdom (On-Site)
• 3 Months ago
Toptracer - Software Engineer in Test (SDET)

Toptracer

Stockholm, Stockholm County, Sweden (Hybrid)
• 1 Month ago
PwC - Manager-Data Engineer|Pune

PwC

Pune, Maharashtra, India (On-Site)
• 4 Months ago
Fortis Games - DevOps Engineer II

Fortis Games

Brazil (On-Site)
• 1 Month ago
Luxoft - Murex Back Office/MxML Consultant

Luxoft

Singapore, Singapore (On-Site)
• 2 Months ago
Steer Studios - Senior IT Administrator

Steer Studios

Riyadh, Riyadh Province, Saudi Arabia (On-Site)
• 8 Months ago
NVIDIA - Senior Memory Controller Verification Engineer

NVIDIA

Santa Clara, California, United States (On-Site)
• 4 Weeks ago
Rackspace Technology - AWS Devops Engineer I - R-20532

Rackspace Technology

Gurugram, Haryana, India (Remote)
• 2 Months ago
ByteDance - Software Engineer, SRE - Platform Services

ByteDance

San Jose, California, United States (On-Site)
• 1 Week ago
PwC - Manager-Data Engineer|Pune

PwC

Pune, Maharashtra, India (On-Site)
• 4 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Pune, Maharashtra, India

Hitachi - D365 F&O ((Offshore delivery - sustained engineering))

Hitachi

Bengaluru, Karnataka, India (Remote)
• 4 Months ago
RedBus - iOS Developer

RedBus

Bengaluru, Karnataka, India (On-Site)
• 4 Months ago
BHAGAT HR SERVICES - Senior Process Engineer

BHAGAT HR SERVICES

Mumbai, Maharashtra, India (On-Site)
• 4 Months ago
PwC - IN_Associate _SAP SD/OTC_Enterprise App SAP_Advisory_Mumbai

PwC

Mumbai, Maharashtra, India (On-Site)
• 3 Months ago
Google - Software Engineer III, Google Cloud

Google

Hyderabad, Telangana, India (On-Site)
• 3 Months ago
Smart Food Safe  - Marketing Manager

Smart Food Safe

Bengaluru, Karnataka, India (On-Site)
• 4 Months ago
CloudHire - Senior Java Developer - Kotlin

CloudHire

India (Remote)
• 3 Months ago
NVIDIA - Food and Beverage Manager

NVIDIA

Hyderabad, Telangana, India (On-Site)
• 1 Month ago
Zeta - Software Development Engineer II (Mumbai)

Zeta

Mumbai, Maharashtra, India (On-Site)
• 4 Months ago
Gunjan App Studios - Unity Developer

Gunjan App Studios

Kolkata, West Bengal, India (On-Site)
• 3 Months ago

Get notifed when new similar jobs are uploaded

Operations Jobs

Barracuda Networks  Inc  - Business Systems Analyst

Barracuda Networks Inc

Alpharetta, Georgia, United States (On-Site)
• 2 Months ago
Tesla - HR Operations Associate - Dutch Speaker

Tesla

North Holland, Netherlands (On-Site)
• 1 Week ago
USE Insider - Solution Architect - Germany

USE Insider

Berlin, Berlin, Germany (Hybrid)
• 4 Months ago
Sandbox VR - Assistant Store Manager

Sandbox VR

Hong Kong (On-Site)
• 4 Months ago
Oil and Gas Job Search  - Sr. Pipeline Engineer - TSI

Oil and Gas Job Search

Pune, Maharashtra, India (Hybrid)
• 5 Months ago
Unity - IT Operations Specialist

Unity

Bengaluru, Karnataka, India (On-Site)
• 3 Months ago
Netflix - Senior Manager, Enterprise Security, Risk & Intelligence

Netflix

New York, New York, United States (On-Site)
• 1 Month ago
Rackspace Technology - Sales Operations Analyst

Rackspace Technology

India (Remote)
• 3 Months ago
Google - Trust and Safety Risk Operations Analyst, Google Play

Google

(On-Site)
• 2 Months ago
Sporty Group - Google Tag Manager/GA4 Implementation Specialist

Sporty Group

(Remote)
• 1 Day ago

Get notifed when new similar jobs are uploaded