Senior Cloud Site Reliability Engineer

11 Minutes ago • 4 Years +
Devops

Job Description

The Senior Cloud SRE works to improve the reliability and availability of our solutions. This includes providing on-call support for Major Incidents and helping us reduce the duration and occurrence of outages. A typical day involves creating dashboards for observability, consulting with development teams on SRE services, automating manual activities, participating in solution design, documenting findings, ensuring proper monitoring, identifying improvements, and assisting with root cause analysis for incidents. The role also involves reviewing work of other SREs, supporting services pre-launch, practicing incident response, and creating automated diagnostics.
Good To Have:
  • Experience working with Prometheus, Datadog, Grafana, Splunk, BMC.
  • Experience with Application Performance Monitoring solutions (Dynatrace, AppDynamics, New Relic).
  • Experience working with Kubernetes, Docker, microservices, serverless compute.
  • Experience working with Ansible, Terraform.
  • Experience with C#, C++, Java, Python, Perl, or Ruby.
Must Have:
  • Improve reliability and availability of solutions.
  • Provide on-call support for Major Incidents.
  • Automate activities to reduce toil.
  • Consult with development teams on SRE services.
  • Ensure proper monitoring and observability (SLI/SLO metrics).
  • Participate in solution design and capacity planning.
  • Practice sustainable incident response and blameless post mortems.
  • 4+ years programming/scripting experience.
  • 4+ years experience in public or private cloud environments.
  • 4+ years SRE experience.
  • Experience with Agile, Jira, GitHub, monitoring, automation, dashboarding.
  • Strong troubleshooting skills for complex issues.
  • Effective technical and non-technical communication.
  • Self-driven and able to work with little supervision.

Add these skills to join the top 1% applicants for this job

cross-functional
problem-solving
communication
performance-analysis
github
cpp
game-texts
agile-development
ruby
c#
incident-response
ansible
prometheus
new-relic
grafana
terraform
docker
microservices
kubernetes
python
perl
splunk
jira
java
system-design

At NiCE, we don’t limit our challenges. We challenge our limits. Always. We’re ambitious. We’re game changers. And we play to win. We set the highest standards and execute beyond them. And if you’re like us, we can offer you the ultimate career opportunity that will light a fire within you.

A Typical Day Might Include the Following:

  • Create a new dashboard to provide observability for a development team of the health of their application. This can include SLI/SLO metrics.
  • Consult with development workstreams on SRE services and how we can assist them improve their reliability.
  • Automate activities previously done manually to reduce toil.
  • Participate in design, definition and scoping of a new solution to meet our internal customer needs. Thoroughly document this and ensure agreement by the participants.
  • Document findings and share with other SREs.
  • Work with teams to ensure proper monitoring is setup/enabled.
  • Identify evolutionary improvements.
  • Meet with Incident and Problem Management to discuss previous Major Incidents and help identify root cause and permanent fixes. Help identify which of these SREs can assist with.
  • Assist other teams in doing data/performance analysis to identify why an issue is occurring.
  • Review work of other SREs and help train them.
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
  • Practice sustainable incident response and blameless post mortems.
  • Assist in creation of automated end-to-end diagnostics.
  • Communicate effectively to technical and non-technical peers and customers.
  • Coordinates and works on multiple cross-functional base work initiatives and projects.
  • Participates in planning long and short term project efforts.
  • Leads or provides technical direction for the planning, execution, and validation of testing work.
  • Provides technical guidance and coaching/mentoring to team members.
  • Follow established processes when performing work or help document and create processes as necessary.
  • Document troubleshooting steps and results in appropriate locations for historical access.
  • Ensures compliance with policies, procedures, and standards.
  • Implements or coordinates remediation required by audits/assessments, and documents as necessary.
  • Provide on call support for high priority incidents.
  • Estimate time to complete activities/projects.

To Land This Gig You'll Need:

  • Bachelor's degree in Computer Science, Business Information Systems, or related field (or equivalent work experience) is required.
  • 4+ years programming/scripting experience.
  • 4+ years of experience working within public or private cloud environments.
  • 4+ years of SRE or related experience.
  • Experience with Agile, Jira, GitHub, monitoring, automation, dashboarding.
  • 6+ years communicating in English in a technical field.
  • Can effectively troubleshoot supported applications effectively.
  • Can work on complex issues which may span multiple applications or environments.
  • Proactively engages with peers to discuss issues and keep stakeholders updated.
  • Mentors co-workers with expertise.
  • Coordinates work with peers.
  • Shares discoveries and best practices.
  • Learns from others within the team.
  • Self-Driven. Proactively looks for ways to improve.
  • Able to work with little supervision and complete tasks and projects as directed.

Bonus Experience:

  • Experience working with Prometheus, Datadog, Grafana, Splunk, BMC.
  • Experience with Application Performance Monitoring solutions-Dynatrace, AppDynamics, New Relic.
  • Experience working with Kubernetes, Docker, microservices, serverless compute.
  • Experience working with Ansible, Terraform.
  • Experience with one or more of the following: C#, C++, Java, Python, Perl, or Ruby.

Set alerts for more jobs like Senior Cloud Site Reliability Engineer
Set alerts for new jobs by Nice
Set alerts for new Devops jobs in United States
Set alerts for new jobs in United States
Set alerts for Devops (Remote) jobs

Contact Us
hello@outscal.com
Made in INDIA 💛💙