Monitoring System Engineer

SoftSwiss

Job Summary

SOFTSWISS is seeking an experienced Monitoring System Engineer to join its expanding DevOps team. The role primarily involves responding to events and monitoring alerts, providing on-duty service coverage including day and night on-call shifts, troubleshooting technical problems, and documenting incidents. Additionally, the engineer will be responsible for maintaining and enhancing monitoring systems by collaborating with teams, setting up and adjusting observability tools, refining alerts and dashboards, building integrations, and updating the Knowledge Base. The ideal candidate will have at least 3 years of experience in a relevant role and strong technical skills.

Must Have

  • Offer on-duty service coverage, encompassing day and night on-call shifts.
  • Provide timely and effective solutions to technical problems.
  • Address incidents by troubleshooting and resolving issues.
  • Keep detailed records and documentation of current infrastructure challenges and Root Cause Analyses (RCAs).
  • Collaborate with other teams to understand and define their monitoring needs.
  • Set up and adjust the monitoring/observability systems for various teams.
  • Design and tweak alerts and dashboards to suit specific needs.
  • Refine alerts to reduce irrelevant notifications and increase their significance.
  • Enhance dashboards for better clarity, understanding, and a more comprehensive view.
  • Establish and update a Knowledge Base, covering system configurations, alert processes, troubleshooting guidelines, and user manuals.
  • Minimum of 3 years' experience as a Systems Engineer, SRE, DevOps, or Monitoring Support Engineer.
  • Good understanding of Linux-like operating systems (Debian-based).
  • Experience with containerization, virtualization, and orchestration (LXC/LXD, Docker, Kubernetes).
  • Development experience in any scripting language (Bash, Python, Go, etc) and familiarity with REST API.
  • Knowledge of basic database concepts (experience with PostgreSQL is preferable), including transactions and WAL.
  • English proficiency at an Intermediate (B1) level or higher.
  • Experience with at least two monitoring/observability tools (Zabbix, Grafana, Prometheus/VictoriaMetrics, ELK/Splunk, Site24x7/Pingdom).
  • Strong understanding of Linux concepts: File systems, Process management, Built-in monitoring tools, Networks, Scripting, Troubleshooting.

Good to Have

  • Familiarity with Kafka
  • Familiarity with RabbitMQ
  • Familiarity with GitLab
  • Familiarity with Nginx/Puma
  • Familiarity with Clickhouse
  • Familiarity with MongoDB
  • Familiarity with Hashicorp Vault
  • Familiarity with Microservices
  • Familiarity with IaC / infrastructure automation
  • Familiarity with Provisioning tools (Terraform)
  • Familiarity with Configuration management (Ansible, Salt, Puppet)

Perks & Benefits

  • Full-time remote work opportunities
  • Flexible working hours
  • Private insurance
  • Additional 1 Day Off per calendar year
  • Sports program compensation
  • Comprehensive Mental Health Programme
  • Free online English lessons with a native speaker
  • Generous referral program
  • Training, internal workshops, and participation in international professional conferences and corporate events

Job Description

SOFTSWISS is growing, and we are seeking a skilled Monitoring System Engineer to join our team. If you are driven by excellence and share our values, we would love to hear from you.

Overview:

SOFTSWISS continues to expand the team and is looking for a Monitoring System Engineer. We need a true, experienced, and accomplished professional who shares our culture and values.

Key responsibilities:

  • The two main pillars of our workflow are:

Responding to Events/Monitoring Alerts (L1/L2 tasks for certain system parts):

  • Offering on-duty service coverage, encompassing day and night on-call shifts.
  • Provide timely and effective solutions to technical problems reported by users.
  • Communicate clearly with users to understand their issues and provide updates on resolution status.
  • Addressing incidents by troubleshooting and resolving issues, even seeking assistance from third-party or vendor support when necessary.
  • Directing issues or queries to the relevant department as needed.
  • Keeping detailed records and documentation of current infrastructure challenges and Root Cause Analyses (RCAs).
  • Creating detailed reports for all technical support incidents, including descriptions, resolutions, and timelines.

Maintaining and Enhancing the Monitoring Systems:

  • Collaborating with other teams to understand and define their monitoring needs, then implementing the right solutions.
  • Setting up and adjusting the monitoring/observability systems for various teams.
  • Designing and tweaking alerts and dashboards to suit specific needs.
  • Refining alerts to reduce irrelevant notifications and increase their significance.
  • Enhancing dashboards for better clarity, understanding, and a more comprehensive view.
  • Building and sustaining connections between the monitoring systems and other platforms like Jira, Opsgenie, etc. when required.
  • Establishing and updating a Knowledge Base, covering system configurations, alert processes, troubleshooting guidelines, and user manuals.
  • Staying updated with the newest trends and best practices to continuously uplift our organization's monitoring capabilities.

Required Experience:

  • Minimum of 3 years' experience as a Systems Engineer, SRE, DevOps, or Monitoring Support Engineer.
  • Good understanding of Linux-like operating systems (Debian-based).
  • Experience with containerization, virtualization, and orchestration (LXC/LXD, Docker, Kubernetes).
  • Development experience in any scripting language (Bash, Python, Go, etc) and familiarity with REST API.
  • Knowledge of basic database concepts (experience with PostgreSQL is preferable), including transactions and WAL.
  • English proficiency at an Intermediate (B1) level or higher. It's crucial to understand technical terminology related to our specific tech stack and to be able to interpret technical documentation.

Skills & Experience

Monitoring/observability tools (experience with at least two of the following)

  • Zabbix (familiarity with concepts such as LLD, prototypes, dependencies, and preprocessing)
  • Grafana (knowledge of data sources, dashboard creation, and query usage)
  • Prometheus/VictoriaMetrics/etc. (understanding of metrics collection and alerting)
  • ELK/Splunk/etc. (ability to use queries and filters for log analysis)
  • Site24x7/Pingdom/etc. (experience with web monitoring and performance metrics)

Linux-like operating systems

Strong understanding of key concepts, including:

  • File systems
  • Process management
  • Built-in monitoring tools
  • Networks
  • Scripting
  • Troubleshooting

Familiarity with

  • Kafka
  • RabbitMQ
  • GitLab
  • Nginx/Puma
  • Clickhouse
  • PostgreSQL
  • MongoDB
  • Hashicorp Vault
  • Microservices and orchestration (Kubernetes)
  • Any IaC / infrastructure automation:
  • Provisioning tools (Terraform)
  • Configuration management (Ansible, Salt, Puppet)

Our Benefits:

  • Full-time remote work opportunities and flexible working hours
  • Private insurance
  • Additional 1 Day Off per calendar year
  • Sports program compensation
  • Comprehensive Mental Health Programme
  • Free online English lessons with a native speaker
  • Generous referral program
  • Training, internal workshops, and participation in international professional conferences and corporate events.

26 Skills Required For This Role

Problem Solving Performance Analysis Game Texts Gitlab Postgresql Prototyping Rabbitmq Nginx Linux Zabbix Prometheus Lxd Ansible Lxc Terraform Grafana Elk Puppet Mongodb Docker Microservices Kubernetes Python Splunk Jira Bash

Similar Jobs