Senior Site Reliability Engineer

13 Minutes ago • 7 Years + • Devops

Job Summary

Job Description

The Hyperconnect Platform Department, comprising SRE, DevOps, Platform Development, and Data Engineering teams, provides infrastructure and common platform technologies for all services like Azar and new products, creating business impact. The SRE team aims to maintain the stability of all services developed at Hyperconnect, ensuring users can enjoy special experiences without inconvenience. They manage incident response, post-mortems, prevention, and support development teams in achieving business missions. They also analyze system availability, reliability, and scalability metrics.
Must have:
  • Build and operate high-availability system infrastructure in a public cloud environment (AWS).
  • Manage infrastructure as code using Terraform, Helm, ArgoCD.
  • Implement stable logging, monitoring, and automation using Zabbix, Prometheus, OpenTelemetry, Elasticsearch, Grafana Mimir.
  • Respond to service incidents, perform root cause analysis, and plan prevention strategies.
  • Identify and optimize service improvement points based on SLO/SLI, focusing on low-latency/high-performance core systems and global media systems.
  • Conduct PoC for new technologies and apply them to production.
  • 7+ years of related experience or equivalent in large-scale service/infrastructure operations.
  • Understanding of overall CS, focusing on Linux and Network.
  • Understanding of overall Container technology.
  • Basic development skills using programming languages like Python, Golang.
  • Practical experience with Linux-based servers in public cloud environments like AWS.
  • Excellent communication and documentation skills for collaboration.
  • Ability to identify and proactively solve service problems.
  • Enjoys learning new technologies.
Good to have:
  • Basic understanding and practical experience with Kubernetes.
  • Experience with Infrastructure-as-a-Code tools.
  • Experience troubleshooting issues related to Java/Kotlin, Spring Framework.
  • Experience operating real-time/highly scalable systems.
  • Experience troubleshooting various incidents in a production environment.

Job Details

Introduction to Platform Department

The Hyperconnect Platform Department, composed of SRE, DevOps, Platform Development, and Data Engineering teams, provides infrastructure and common platform technologies for all services, including Azar and new products, creating business impact. We also contribute to preventing technical siloization and fostering an excellent engineering culture across the company.

How the Platform Department Works

  • We don't just create the infrastructure needed by development teams; we provide appropriate system designs that encompass both business and technical requirements. In this process, we lead the overall process of engaging with stakeholders from various departments, including development teams and, if necessary, other job functions, to conduct Q&A sessions.
  • We proactively explore various information such as technical metrics, logs, and source code to identify, define, and solve problems that lie in the blind spots of development teams or SRE/DevOps teams.
  • We create documentation on basic design methods and best practices to help development teams or other SRE/DevOps colleagues solve problems independently.
  • We do not settle for current technology stacks but actively explore and adopt new technologies that can better solve problems.

Introduction to SRE Team

The SRE team aims to maintain the stability of all services developed at Hyperconnect, ensuring users can enjoy the special experiences provided by Hyperconnect without inconvenience.

  • We manage incident response, post-mortem analysis, prevention activities, and improvement of incident response manuals from a company-wide perspective, performing all activities to ensure stable service delivery.
  • We collect difficulties faced by development teams through various channels and actively support them in achieving business missions together.
  • We analyze and monitor various metrics related to system availability, reliability, and scalability, and gradually improve them with the service teams.
  • We provide active training and guidance to help developers effectively use the systems provided by DevOps/SRE.
  • We support all developers to deploy without fear and manage and improve various platforms for this purpose with the DevOps team.

If you join our team, you will:

  • Actively use and experience modern computing and network infrastructure such as AWS, K8S, Service Mesh for all services and systems.
  • Contribute deeply to backend engineering beyond simple infrastructure management and provisioning support.
  • Engage in deep consideration of high-performance / low-latency systems due to the real-time nature of our business.
  • Experience various know-how and best practices for managing large-scale infrastructure in a global environment, multi-products, and complex production environments spanning B2B and B2C.

Check out how the SRE team works and what problems they solve in HyperLink sessions!

  • [\[HyperLinkDevOps\] Session 3. Serving ML and Media Services that Sustain Hyperconnect to 230 Countries](https://youtu.be/6l8a3FoyFM)
  • [\[HyperLink_DevOps\] Q&A Session](https://youtu.be/ZD60a4bTTAo)

Responsibilities

Building and Operating High-Availability System Infrastructure in Public Cloud Environments

  • Build and operate server system infrastructure in an AWS cloud environment.
  • We prefer managing infrastructure as code rather than working from the cloud provider's console, so we use Terraform, Helm, ArgoCD, etc. Application deployment is done to Kubernetes-based container environments via a Spinnaker-based IDP (internal developer platform).

System/Application Logging, Monitoring, and Automation

  • Stable logging and monitoring are essential for service stability. We use Zabbix and Prometheus to automate monitoring as much as possible, beautifully configuring the system so that infrastructure resources are auto-discovered and appropriate alarms are set without human intervention.
  • We actively use OpenTelemetry and Elasticsearch to perform application monitoring for 300+ microservices.
  • If an automation tool is needed in this process, we develop it ourselves and share it internally.
  • We also stably manage and collect enterprise-scale time-series metrics using Grafana Mimir.

Leading Service Incident Response and Post-Mortem Culture

  • Together with service development teams, we take appropriate actions when incidents occur, perform root cause analysis, and plan and execute strategies to prevent recurrence.
  • Furthermore, we develop and operate various processes and tools to ensure all these activities are well-executed company-wide, continuously improving and evolving them.

Discovering Service Improvement Points and Problems, Optimization based on SLO/SLI

  • We continuously monitor problems that arise during service operation, identify and improve points or problems in various aspects such as service performance, stability, and scalability.
  • In particular, we significantly contribute to Hyperconnect's most critical low-latency/high-performance core systems and global media systems.
  • These improvement tasks encompass various technical areas, including cloud infrastructure, CDN/Network, application optimization, and the introduction of new solutions.

PoC for New Technologies and Production Application

  • We develop or research various tools to improve reliability and apply them to actual operating environments. We actively apply newly developed tools to development environments, discuss their pros and cons, and apply them to operating environments while maintaining stability through thorough verification.

Requirements

  • 7+ years of related experience or equivalent experience in large-scale service/infrastructure operations.
  • Understanding of overall CS, focusing on Linux and Network.
  • Understanding of overall Container technology.
  • Basic development skills using programming languages like Python, Golang.
  • Practical experience with Linux-based servers in public cloud environments like AWS.
  • Excellent communication skills and documentation ability required for collaboration with various organizations.
  • Ability to identify various problems occurring in services and proactively propose solutions.
  • Enjoys learning new technologies following tech trends.

Preferred Qualifications

  • Basic understanding and practical experience with Kubernetes.
  • Experience with Infrastructure-as-a-Code tools.
  • Experience troubleshooting issues related to Java/Kotlin, Spring Framework.
  • Experience operating real-time/highly scalable systems.
  • Experience troubleshooting various incidents in a production environment.

Hiring Process

  • Employment Type: Full-time
  • Hiring Process: Document Screening > Coding Test/Assignment > 1st Interview > Recruiter Call > 2nd Interview > Final Offer (*Process may be added or changed if necessary.)
  • For document screening, only successful candidates will be notified individually.
  • Application Documents: Detailed resume (PDF) based on career, free format, in Korean or English.
  • This position is available for Professional Research Personnel (현역 편입/전직, 보충역 편입/전직). For military service special personnel, service management will proceed according to military service special laws.

If any false information is found in the submitted content or if there are disqualifying reasons for employment under relevant laws, employment may be canceled. If necessary, additional screening and document verification may be conducted beyond the recruitment process announced in advance.

National meritorious persons are given preferential treatment according to relevant laws; if applicable, please notify us when applying and submit supporting documents upon employment.

When applying for a position at Hyperconnect, this privacy policy applies to the processing of personal information: https://career.hyperconnect.com/privacy

Similar Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Similar Skill Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Jobs in Seoul, South Korea

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Devops Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!
Contact Us
hello@outscal.com
Made in INDIA 💛💙