Cloud SRE

2 Months ago • 4 Years +

Devops

Job Description

As a Site Reliability Engineer (SRE) for our large and regionally distributed SaaS platform, your primary responsibilities will be to improve the reliability and availability of our mission-critical cloud-based services. You will create dashboards and metrics for observability, consult with development teams on SRE best practices, and automate tasks to reduce manual intervention. Additionally, you will assist in incident and problem management, share knowledge, mentor other SREs, and ensure compliance with processes and documentation.

Good To Have:

Kubernetes
Kubernetes certification
Grafana
AWS
Azure
DevOps experience

Must Have:

Create new dashboards and metrics for comprehensive observability, including SLI/SLO metrics.
Work with development teams to ensure proper monitoring is set up and enabled.
Identify evolutionary improvements to observability and monitoring solutions.
Consult with development teams on SRE services and best practices.
Create automation and tooling to reduce toil and manual intervention.
Assist other teams in data and performance analysis to identify root causes.
Review the work of other SREs and provide training and guidance.
Communicate effectively with both technical and non-technical peers and customers.
Follow established processes or help document and create new ones as necessary.
Document troubleshooting steps and results in appropriate locations.
Ensure compliance with policies, procedures, and standards.
Implement or coordinate remediation required by audits and assessments.
Estimate the time required to complete activities and projects.

Perks:

Fast-paced, collaborative, and creative environment
Learning and growth opportunities
Endless internal career opportunities across multiple roles, disciplines, domains, and locations
NICE-FLEX hybrid model (2 days office, 3 days remote work each week)

Add these skills to join the top 1% applicants for this job

team-management

saas-business-models

problem-solving

communication

performance-analysis

github

game-texts

agile-development

aws

azure

grafana

kubernetes

python

jira

At NiCE, we don’t limit our challenges. We challenge our limits. Always. We’re ambitious. We’re game changers. And we play to win. We set the highest standards and execute beyond them. And if you’re like us, we can offer you the ultimate career opportunity that will light a fire within you.

So, what’s the role all about?

How will you make an impact?

Essential Duties and Responsibilities:

1. Observability and Monitoring:

Create new dashboards and metrics to provide comprehensive observability into the health and performance of development teams' applications, including SLI/SLO metrics.
Work with development teams to ensure proper monitoring is set up and enabled for their services.
Identify evolutionary improvements to the observability and monitoring solutions.

3. Reliability Consulting and Automation:

Consult with development teams on SRE services and best practices to help them improve the reliability of their applications.
Create automation and tooling to reduce toil and manual intervention.

5. Incident and Problem Management:

Assist other teams in data and performance analysis to identify the root causes of issues and recommend automation actions.

7. Knowledge Sharing and Mentoring:

Review the work of other SREs and provide training and guidance to help them improve their skills.
Communicate effectively with both technical and non-technical peers and customers.

9. Process and Documentation:

Follow established processes when performing work or help document and create processes, as necessary.
Document troubleshooting steps and results in appropriate locations for historical access.
Ensure compliance with policies, procedures, and standards.
Implement or coordinate remediation required by audits and assessments, and document, as necessary.

11. Time Estimation:

Estimate the time required to complete activities and projects.

Have you got what it takes?

4+ years programming/scripting experience with any of the following: (Go, Python, .Net (C#), Node)
4+ years of experience working within public or private cloud environments
4+ years of SRE/DevOps/Observability or related experience
4+ years of AWS
Experience with Agile, Jira, GitHub, monitoring, automation, dashboarding

You will have an advantage if you also have:

Kubernetes + certification, Grafana, AWS, Azure, DevOps experience.

What’s in it for you?

Join an ever-growing, market disrupting, global company where the teams – comprised of the best of the best – work in a fast-paced, collaborative, and creative environment! As the market leader, every day at NICE is a chance to learn and grow, and there are endless internal career opportunities across multiple roles, disciplines, domains, and locations. If you are passionate, innovative, and excited to constantly raise the bar, you may just be our next NICEr!

Enjoy NICE-FLEX!

At NICE, we work according to the NICE-FLEX hybrid model, which enables maximum flexibility: 2 days working from the office and 3 days of remote work, each week. Naturally, office days focus on face-to-face meetings, where teamwork and collaborative thinking generate innovation, new ideas, and a vibrant, interactive atmosphere.

Requisition ID:7547

Reporting into: Manager, Cloud Operations

Role Type: Individual Contributor

Set alerts for more jobs like Cloud SRE

Set alerts for new jobs by Nice

Set alerts for new Devops jobs in India

Set alerts for new jobs in India

Set alerts for Devops (Remote) jobs