Site Reliability Engineer / Observability Engineer
Rackspace Technology
Job Summary
Rackspace is seeking a Site Reliability Engineer/Observability Engineer to join its Professional Services Center of Excellence. Responsibilities include implementing observability solutions using Datadog, New Relic, AppDynamics, or Dynatrace; building and maintaining scalable systems; developing monitoring tools and dashboards; performing anomaly detection and performance tuning; collaborating with development teams; and identifying and resolving service issues. The ideal candidate will have senior-level experience in SRE, DevOps, application support, AWS infrastructure, and automation. Experience with observability tools, AWS, scripting (Terraform/Cloud Formation), configuration management (Ansible/Chef/Puppet), and agile development is required.
Must Have
- Senior-level SRE/DevOps experience
- AWS infrastructure expertise
- Observability tool experience (Datadog, New Relic etc.)
- Automation & scripting skills
- Application support and troubleshooting
- Agile development experience
Good to Have
- Experience with Splunk, SignalFx
- Python, PHP, Perl, Ruby, Linux Shell scripting
- Terraform or Cloud Formation
- Ansible, Chef, or Puppet
- Understanding of AWS pricing models
Job Description
You Will:
- Work with customers and implement Observability solutions
- Build and maintain scalable systems and robust automation that supports engineering goals.
- Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
- Proactively gather and analyze both metric and log data from systems and applications to perform anomaly detection, performance tuning, capacity planning and fault isolation.
- Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security and performance standards
- Collaborate with team members to document and share solutions
- Maintain a deep understanding of the customer’s business as well as their technical environment
- Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues
You Have:
- Bachelor’s degree in engineering/computer science or equivalent
- Senior-level experience with Site Reliability Engineering, DevOps, Code level application support and troubleshooting, AWS Infrastructure design, implementation and optimization, Automation for deployment, scaling and reliability.
- Experience with observability solutions tools like Splunk, Datadog, SignalFx, etc.
- Experience deploying, maintaining and supporting software applications/services in the AWS ecosystem
- Proactive approach to identifying problems and solutions
- Experience writing code with one or more interpreted languages such as Python, PHP, Perl, Ruby,Linux Shell
- Experience with Terraform or Cloud Formation scripting
- Experience with configuration management tools like Ansible, Chef or Puppet
- Experience with standard software development best practices and tools such as code repositories (Git preferred)
- Experience executing in an agile software development environment
- Good understanding of pricing/cost models across AWS services, especially compute, storage, and database offerings
- A clear understanding of network & system Management solutions
- Excellent organizational and project management skills
- Excellent communication, critical thinking & analytical skills