SRE - Systems, Networking, Cloud & Development
Thales
Job Summary
Thales is seeking a Site Reliability Engineer (SRE) with expertise in Systems, Networking, Cloud, and Development. The role involves applying SRE principles like measurement, toil elimination, and reliability modeling. You will educate development teams on best practices, architect infrastructure solutions, and troubleshoot operational issues. Responsibilities include proactive data analysis, testing network/system integrity, resolving business-impacting issues, participating in escalations, incident response, RCA, blameless postmortems, and on-call rotations. The ideal candidate will have at least 3 years of experience in cloud/web/CDN scale infrastructure, proficiency in Python and Go, expert knowledge of Linux, networking protocols, and experience with DevOps principles, CI/CD, monitoring tools like Prometheus and Grafana, big data technologies, and container management (Docker, Kubernetes). Experience with C/C++, BGP, Anycast routing, and analyzing data telemetry is a plus.
Must Have
- 3+ years in cloud/web/CDN infrastructure
- Experience with Python and Go
- Expert Linux systems knowledge
- Expert network programming and protocols
- Experience with DevOps principles
- Experience with CI/CD tools
- Experience with monitoring tools
- Experience with big data technologies
- Experience with containers and orchestration
- Experience with data telemetry and pipelines
- Experience in software development and monitoring distributed systems
- Experience with Agile methodologies
- Team player, accountable for business urgency
Good to Have
- C/C++ experience
- BGP and Anycast routing experience
- Infrastructure as Code experience
- UI visualization experience
Perks & Benefits
- Career development opportunities
- Global mobility policy
- Flexibility in working
Job Description
Responsibilities
- Apply SRE core tenets of measurement (SLI/SLO/SLA), eliminate toil, and reliability modeling
- Enable and educate development teams on industry best practice design patterns, ways of working and operational knowledge to ensure platform continuity
- Develop and architect solutions to infrastructure and operational aspects of new products and feature sets
- Assist with go/no go preplanning, verification/validation, and review of existing and new product/services
- Proactively analyze data and test the integrity of network/systems to ensure production applications and services are operating optimally
- Work within development teams to troubleshoot and resolve business affecting issues
- Escalations, incident response, RCA, and blameless postmortem
- Participate in on-call rotation
Qualifications
- At least 3 years of professional experience within a cloud/web/CDN scale infrastructure
- Experience with Python and Go. C/C++ a plus
- Expert knowledge of Linux systems, network programming and protocols TCP, UDP, DNS, TLS/SSL, HTTP
- Experience with BGP and Anycast routing is a plus
- Experience with DevOps principles and concepts such as Infrastructure as Code (Ansible/Saltstack), CI/CD (Gitlab, Jenkins, Git), monitoring and visualization (Prometheus, Grafana)
- Experience with big data technologies such as NoSQL/RDBMS, Redis, ElasticSearch, Kafka
- Experience with containers and container management (Docker, Kubernetes)
- Experience analyzing and building data telemetry, modeling, pipelines, UI visualization
- Experience in developing software, troubleshooting, and monitoring large scale distributed systems
- Implement software engineering best practices/standards and software development life cycle
- Working knowledge and experience of Agile software development methodologies
- A strong team player who is accountable towards business urgency
- Ability to stay organized in a multi-tasking environment
- Self-starter personality