Platform Reliability Analyst - BOT

4 Months ago • All levels • Operations

Job Summary

Job Description

Job Details

The Platform Reliability Analyst is responsible for ensuring the continuous monitoring and overall health of our cloud infrastructure hosted on platforms such as AWS, Rackspace, Expedient, and Heroku. This role involves proactive monitoring of system performance, coordinating incident response efforts, and collaborating with development, cloud, and operations teams to address issues before they impact the business. The ideal candidate will have a process-oriented mindset, strong communication skills, and a foundational understanding of cloud technologies to facilitate rapid resolution of incidents and optimize system performance.

Key Responsibilities:

    • Proactive System Monitoring: Oversee system performance and availability through continuous monitoring of alerts from various APM tools (New Relic, Cloudwatch, etc.). Provide feedback on alert tuning, identify patterns in incidents, and pinpoint optimization opportunities (e.g., identifying idle systems that could be shut down).
    • Production Support: Build and maintain a comprehensive understanding of all software systems and their variations. Ensure readiness to support production systems by identifying potential issues before they affect customers.
    • Outage Management: Lead the incident command center during outages with a focus on rapid resolution. Coordinate incident response by:
    • Recording incident start/end times and affected systems.
    • Notifying internal stakeholders and support teams of the incident status.
    • Coordinating the involvement of the correct teams and ensuring all relevant details are shared.
    • Providing and executing runbooks, or coordinating with cloud teams for execution.
    • Running incident bridges, ensuring systems, logs, and traffic are monitored and relevant experts are involved.
    • Documenting facts versus theories in real-time during incident resolution.
    • Incident Communication: Notify the company about incidents and coordinate with support to inform customers. Eventually, manage status updates on a future status page for system transparency.
    • Incident Prevention and Follow-Up: Be the first line of defense—proactively identify system issues before customers are impacted. Conduct root cause analysis (RCA) after incidents to determine underlying issues and implement preventative measures. Update and create runbooks as needed.
    • Collaboration and Coordination: Regularly set up meetings with cloud and development teams to address and resolve recurring issues. Communicate proactively with leadership about any potential cost increases or system inefficiencies.
    • System Health Metrics: Monitor traffic, system health, security perimeter, and overall performance. Track key metrics such as the percentage of issues identified proactively versus reactively.

Key Skills and Qualifications:

    • Strong Communication Skills: Clear, concise English to convey the status of incidents and performance issues to both technical and non-technical stakeholders.
    • Process-Oriented Mindset: Ability to follow, document, and improve processes to ensure smooth incident management and resolution.
    • Attention to Detail: Capability to record key details about system health, performance, and incident facts versus theories in real-time.
    • Familiarity with Monitoring Tools: Experience using monitoring and alerting tools such as New Relic, Cloudwatch, or Datadog, and familiarity with logs, traffic monitoring, and system health metrics.
    • Coordination and Leadership Skills: Ability to lead incident response teams, coordinate with various technical experts, and manage communication effectively during outages.
    • Basic Technical Understanding: While not an engineering role, some technical familiarity with cloud environments, system alerts, and security practices is important. Entry-level engineers with an interest in coordination roles are encouraged to apply.
    • Collaboration: Ability to work cross-functionally with development, cloud, and support teams to ensure smooth operations and proactive issue resolution.
undefined

Similar Jobs

The Walt Disney Company - Sr Software Engineer (webOS/Tizen)

The Walt Disney Company

San Francisco, California, United States (On-Site)
3 Months ago
Netflix - Product Manager, ML Platform: Training

Netflix

Los Gatos, California, United States (Hybrid)
3 Months ago
DAZN India - Javascript Developer

DAZN India

Hyderabad, Telangana, India (On-Site)
4 Months ago
Zeta - Senior Site Reliability Engineer

Zeta

Hyderabad, Telangana, India (On-Site)
4 Months ago
Palo Alto Networks - Prisma Cloud Solution Architect

Palo Alto Networks

Baton Rouge, Louisiana, United States (Remote)
3 Months ago
Playtika - Incident Engineer (NOC/SLS)

Playtika

Ukraine (On-Site)
2 Months ago
PwC - Workday Support

PwC

Amman, Amman Governorate, Jordan (On-Site)
4 Months ago
The Walt Disney Company - Content Distribution Engineer

The Walt Disney Company

Morrisville, North Carolina, United States (On-Site)
3 Months ago
The Walt Disney Company - Entertainment Lead - 6-12 months contract

The Walt Disney Company

Hong Kong (On-Site)
3 Months ago
ByteDance - LLM Training Operation (Safety) - Specialist

ByteDance

Singapore (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Trend Micro - Senior Software Development Engineer

Trend Micro

Manila, Metro Manila, Philippines (Hybrid)
4 Months ago
IGT - Systems Engineer III

IGT

Santiago De Surco, Lima Province, Peru (On-Site)
3 Months ago
PwC - Business Enablement Technical SAP professional

PwC

Athens, Greece (Hybrid)
4 Months ago
CloudHire - Salesforce Developer L5/6 (Vlocity)

CloudHire

Pune, Maharashtra, India (Remote)
3 Months ago
Illuvium - Senior C# Engineer

Illuvium

Copenhagen, Denmark (Remote)
4 Months ago
Genies - Senior Software Engineer (3D Graphics)

Genies

Los Angeles, California, United States (On-Site)
5 Months ago
Warner Bros Discovery - Customer Data Manager - Digital/Video On Demand

Warner Bros Discovery

Masovian Voivodeship, Poland (Hybrid)
2 Months ago
Nagarro - Staff Engineer

Nagarro

Portugal (Remote)
4 Months ago
Scopely - Senior Machine Learning Engineer - LiveOps Automation Team

Scopely

Barcelona, Catalonia, Spain (Hybrid)
2 Months ago
Info Stretch - Senior Engineer

Info Stretch

Mumbai, Maharashtra, India (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in India

Anblicks - Lead DevOps Engineer

Anblicks

Hyderabad, Telangana, India (On-Site)
3 Months ago
Instawork - Software Engineer - E2

Instawork

Bengaluru, Karnataka, India (On-Site)
5 Months ago
Upstox - Manager - Sales (Retail Insurance)

Upstox

Mumbai, Maharashtra, India (On-Site)
4 Months ago
Mellow Designs - Art Director | 5 - 7 Yrs | Branding Domain | Agency Background

Mellow Designs

Bengaluru, Karnataka, India (Hybrid)
5 Months ago
Nielsen Holdings - Platform DevOps Engineer

Nielsen Holdings

Mumbai, Maharashtra, India (Hybrid)
4 Months ago
Centum Electronics  - Radar GUI/HMI Developer

Centum Electronics

Bengaluru, Karnataka, India (On-Site)
5 Months ago
Assystems - Project Engineering Manager – Substation (Civil & Structural)

Assystems

Gurugram, Haryana, India (On-Site)
3 Months ago
PlaySimple - Associate QA Engineer

PlaySimple

Karnataka, India (On-Site)
4 Months ago
Ajmera Infotech - React Developer

Ajmera Infotech

Bengaluru, Karnataka, India (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

Operations Jobs

Google - Strategy and Operations Manager IV, Google Cloud (English)

Google

Mexico City, Mexico City, Mexico (On-Site)
1 Month ago
ComeOn Group - Sportsbook Analyst

ComeOn Group

St. Julian's, Malta (Hybrid)
4 Months ago
Capgemini - Pega CI/CD | 14 to 15 Years | Gurugram

Capgemini

Gurugram, Haryana, India (On-Site)
5 Months ago
DraftKings - Director, Sportsbook Operations

DraftKings

Las Vegas, Nevada, United States (On-Site)
3 Months ago
GameChanger  - Tooling & Automation Manager, Customer Support

GameChanger

United States (Remote)
2 Months ago
AXA XL - MDM Analyst

AXA XL

Bengaluru, Karnataka, India (Hybrid)
4 Months ago
Evolution - Training Academy Manager (Online gaming industry)

Evolution

New Westminster, British Columbia, Canada (On_site)
3 Months ago
Warner Bros Discovery - Sr. Manager, IT Internal Audit & Advisory

Warner Bros Discovery

New York, New York, United States (Hybrid)
2 Months ago
Scopely - Stock Plan Administrator

Scopely

California, United States (Remote)
2 Months ago
PTW - Project manager: Player support division

PTW

Shinjuku City, Tokyo, Japan (Remote)
3 Months ago

Get notifed when new similar jobs are uploaded

About The Company

We co‑innovate with the world's most ambitious brands to create transformative digital experiences.

 

Chennai, Tamil Nadu, India (Hybrid)

Chennai, Tamil Nadu, India (On-Site)

Chennai, Tamil Nadu, India (On-Site)

Chennai, Tamil Nadu, India (Hybrid)

Chennai, Tamil Nadu, India (Hybrid)

Bengaluru, Karnataka, India (Hybrid)

View All Jobs

Get notified when new jobs are added by Bounteous

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug