Purpose of the Role
We are seeking a NOC Level 2 Support Engineer to work in our Cape Town office, who can apply their technical skills in a fast-paced and complex environment. The Level 2 NOC Engineer’s duties will include monitoring, diagnosing, troubleshooting, tracking, and documenting the multiple environments and day-to-day customer support and interactions.
Strong communication skills are a particularly important requirement for this role. NOC Level 2 Support Engineers should enjoy working in a fast-paced environment where adaptability and flexibility will be key to their success. The successful candidate will be able to work independently and within groups working in shifts to cover 24/7 coverage.
Responsibilities:
- Monitor Dashboards of key metrics, proactively detecting any possible incidents before they occur.
- Proactive monitoring of Slack channels for issues raised both internally and externally.
- Investigate, diagnose, troubleshoot, and resolve incidents where possible.
- Escalate incidents that require additional expertise with SRE, DBA’s, Dev Support, etc, and work with them until the incident is resolved.
- Working with Incident managers and other teams in war rooms for P1/2 issues to restore operations as quickly as possible.
- Be involved in the root cause analysis of incidents and help with incident reports.
- Adding and updating documentation for runbooks used to help troubleshoot and resolve incidents and share knowledge with the rest of the team.
- Implement and improve processes for monitoring/alerting, systems maintenance, and escalation.
- Helping and guiding the development of tooling used to troubleshoot and resolve issues to make NOC work more effectively.
- Develop key dashboards for transparency of reporting uptime and other metrics as identified.
Requirements:
- A minimum of 2 years’ experience working in a NOC team offering 24/7 critical support.
- Excellent troubleshooting and creative problem-solving abilities.
- Background in Linux administration.
- Good networking understanding (TCP/IP, DNS, routing, firewalls, etc.).
- Good understanding of technologies such as Apache, Nginx, Databases, DNS servers, etc.
- Experience with supporting Cloud-based applications – we use Amazon Web Services (AWS).
- Experience in using monitoring systems and investigating issues at a log level – we use Datadog.
- Experience coordinating and collaborating with multiple teams such as Helpdesk & SRE.
- Excellent communication and interpersonal skills.
- Ability to offer flexibility during peak times and critical projects for changing shift patterns.
- Experience in creating technical documentation and reports.
- Readiness to offer training to colleagues when needed.
Advantageous:
- Experience with Datadog Monitoring and Incident Management is a plus.
- Experience maintaining continuous integration and delivery pipelines with tools such as Jenkins and CircleCI.
- Scripting/programming knowledge of at least Unix shell scripting.
- Background in Windows Administration.
- Experience with Postgres.
- Experience with Kubernetes.