Senior Machine Learning Software Engineer (ML Platform)

Match Group

5+ Years | Seoul, South Korea (Hybrid) | Full Time | 3 weeks ago

Apply Now

Job Summary

The ML Platform Team at Hyperconnect's AI Lab is seeking a Senior Machine Learning Software Engineer. This role focuses on automating and stabilizing the entire ML production process to drive business impact and maximize R&D productivity. Responsibilities include building cloud-based ML Ops infrastructure, developing tools, managing high-performance GPU clusters, optimizing model performance and operational costs, and developing mobile inference engines. The team aims to create a seamless research and development environment for ML Engineers.

Must Have

5+ years of experience in software design, implementation, and operation (backend, data engineering, distributed systems)
Solid fundamental knowledge of CS fundamentals (OS, networks, computer system architecture, data structures, algorithms)
Deep understanding of two or more languages among Java (Kotlin), Python, Golang, JavaScript, Swift, C++, RUST
Ability to adapt quickly to unfamiliar environments and navigate various tech stacks
Experience in designing, building, and operating infrastructure in Public Cloud environments (AWS, GCP, Azure)
Deep understanding and operational experience with container technologies (Kubernetes, Docker)
Basic understanding and interest in ML/AI technologies and the MLOps ecosystem
Excellent communication and collaboration skills with ML Engineers and Data Scientists
Fluent in Korean and comfortable with English (listening, speaking)

Good to Have

Experience with MLOps open-source tools (Argo Workflows, ArgoCD, Kubeflow, MLflow)
Experience operating ML model serving platforms (NVIDIA Triton, KServe/KFServing)
Experience operating large-scale Managed Kubernetes services (EKS, GKE)
Experience using IaC tools (Terraform, CloudFormation)
Experience building and automating CI/CD pipelines (GitHub Action, ArgoCD)
Experience building and operating monitoring systems (Prometheus, Grafana)
Deep understanding of Linux systems and networks
Experience operating and scheduling large-scale GPU clusters (On-premise or Cloud)
Experience building and operating large-scale data processing pipelines (Spark, Airflow)
Experience developing internal tools (SDKs, CLIs, developer portals) for ML Engineers
Experience utilizing AI accelerators (AWS Inferentia/Trainium) and cost optimization
Experience serving or optimizing ML models operating in on-device environments

Job Description

HYPERCONNECT AI - AI Lab Introduction

Hyperconnect AI Lab identifies and solves problems in services that connect people, which are difficult to approach with existing technologies but can be resolved through machine learning. By doing so, we innovate the user experience. To achieve this, we develop numerous models across various domains, including video, audio, natural language, and recommendations. Our goal is to contribute to the growth of actual services by stably providing these models via mobile and cloud servers and resolving any challenges encountered. Under this objective, Hyperconnect AI Lab has been advancing machine learning technologies for several years, contributing to Hyperconnect's products, including Azar.

ML Platform Team Introduction

The ML Platform Team, part of the AI Lab, automates and stabilizes the entire ML production process to ensure that AI technology quickly translates into business impact. Our aim is to maximize the research and development productivity of the entire organization through a sustainable platform.

Currently, we address complex technical challenges arising from operating over 50 models in production. To successfully accomplish this mission, we are responsible for the following core tasks:

Building Cloud-based ML Ops Infrastructure and Tool Development

We develop and operate ML Ops components to establish an automated virtuous cycle (AI Flywheel) that utilizes product data to retrain, evaluate, and deploy models, thereby continuously improving products. Key components include:

Providing a unified serving platform using ArgoCD and NVIDIA Triton to rapidly deploy models trained with various deep learning frameworks (Tensorflow, PyTorch) across different domains to production.
Offering an Argo Workflows-based training workflow platform that enables users to easily create and execute necessary workflows.
Delivering data pipelines to facilitate the processing of raw data into training-ready formats.

Furthermore, we provide developer portals, SDKs, and CLI tools to control and leverage the aforementioned ML Ops components and platforms, making it easy to build continuous learning pipelines. We also conduct Proof of Concepts (PoC) for rapidly evolving MLOps new technologies and apply them to production when necessary, continuously improving the system.

Building and Operating High-Performance GPU Clusters

To support seamless ML research and large-scale model training, we design and build a Slurm-based HPC (High-Performance Computing) GPU cluster optimized for business requirements. This includes the latest GPU resources such as A100/H100, as well as high-speed interconnects like InfiniBand (EDR/HDR/NDR) to minimize bottlenecks between nodes.

We meticulously tune scheduling policies to cost-effectively share limited computing resources within the research organization. We segregate partitions based on workload characteristics and manage job priorities. Additionally, we monitor key metrics by integrating Prometheus and Grafana with Slurm's accounting data, continuously optimizing resource allocation.

To ensure cluster stability and reproducibility, we manage various configurations using IaC (Infrastructure as Code) tools such as Ansible and Terraform. We also integrate parallel/network file systems like Lustre and NFS for large-capacity training data.

We develop and operate automation tools for cluster management, monitoring, disaster recovery, and handling user requests.

Model Performance and Operational Cost Optimization

For large-scale model training, we adopt cutting-edge distributed training technologies such as FSDP (Fully Sharded Data Parallel) and DeepSpeed to accelerate training speed. For serving, we apply model compilation using NVIDIA TensorRT and ONNX Runtime to meet business requirements (e.g., Latency vs. Throughput). We also implement lightweighting techniques like INT8/FP16 quantization to reduce response times.

We maximize throughput with dynamic batching using Triton Inference Server and significantly reduce cost per query by leveraging high-efficiency computing resources like AWS Inferentia. Through performance profiling, we monitor key metrics such as resource utilization, P99 Latency, and RPS (Requests Per Second), and continuously improve cost-effectiveness by implementing efficient auto-scaling policies using KEDA (Kubernetes Event-driven Autoscaling).

More detailed information on inference optimization with AWS Inferentia can be found at the following links:

Developing Inference Engines for Mobile Devices

We research and develop an inference engine SDK that enables Hyperconnect's on-device AI models to operate stably and efficiently in mobile environments using various frameworks such as TFLite and PyTorch Mobile. Beyond simple model conversion, we apply the latest techniques such as quantization, pruning, SIMD optimization, and GPU/NNAPI acceleration to minimize latency and optimize battery and memory usage.

We also establish mobile model build and deployment pipelines, test automation environments, profiling, and debugging to ensure consistent performance across diverse device environments like iOS/Android. This allows us to deliver a commercial-grade mobile AI platform capable of providing models developed in the research stage to a large user base.

In this process, beyond pure engineering, we collaborate with research teams to explore optimization strategies suitable for model structures and make balanced decisions between model performance and user experience. Consequently, the mobile inference engine we develop ensures seamless and rapid responsiveness with a positive user experience even in resource-constrained environments, delivering AI-based user experience innovation to global users.

Engineering for Organizational Productivity Improvement

We enhance and automate inefficiencies across the entire ML model lifecycle, from data collection and preprocessing to model deployment and monitoring. Beyond merely providing platforms and tools, we quantitatively measure the development experience of ML Engineers. We define and monitor key productivity metrics such as time to first experiment or model deployment lead time. By thoroughly analyzing and improving identified bottlenecks and root causes, we foster a research and development environment where ML Engineers can focus solely on solving core business problems without expending time on infrastructure setup or debugging.

Requirements

5+ years of experience in software (backend, data engineering, distributed systems, etc.) design, implementation, and operation.
Solid fundamental knowledge of CS fundamentals (operating systems, networks, computer system architecture, data structures, and algorithms).
Deep understanding of two or more languages among Java (Kotlin), Python, Golang, JavaScript, Swift, C++, RUST, and no difficulty in adopting new languages.
Enjoys navigating various tech stacks through diverse development experiences and can quickly adapt to unfamiliar environments.
Experience in designing, building, and operating infrastructure in Public Cloud environments such as AWS, GCP, Azure.
Deep understanding and operational experience with container technologies such as Kubernetes and Docker.
Basic understanding and interest in ML/AI technologies and the MLOps ecosystem.
Excellent communication and collaboration skills with related departments (ML Engineer, Data Scientist, etc.).
Comfortable with listening and speaking English, and fluent in Korean communication.

Preferred Qualifications

Experience utilizing MLOps open-source tools such as Argo Workflows, ArgoCD, Kubeflow, MLflow.
Experience operating ML model serving platforms such as NVIDIA Triton, KServe/KFServing.
Experience operating large-scale Managed Kubernetes services such as EKS, GKE.
Experience using IaC (Infrastructure as Code) tools such as Terraform, CloudFormation.
Experience building and automating CI/CD pipelines using GitHub Action, ArgoCD.
Experience building and operating monitoring systems using Prometheus, Grafana.
Deep understanding of Linux systems and networks.
Experience operating and scheduling large-scale GPU clusters (On-premise or Cloud).
Experience building and operating large-scale data processing pipelines (e.g., Spark, Airflow).
Experience developing internal tools such as SDKs, CLIs, developer portals for ML Engineers.
Experience utilizing AI accelerators such as AWS Inferentia/Trainium and cost optimization.
Experience serving or optimizing ML models operating in on-device environments.

Hiring Process

Employment Type: Full-time
Recruitment Process: Document Screening > Coding Test > 1st Interview > 2nd Interview > 3rd Interview (if applicable) > Final Offer
For document screening, only successful candidates will be notified individually.
Application Documents: Free-form detailed resume based on career experience in Korean or English (PDF)

If any false information is found in the submitted content or if there are disqualifying reasons for employment under relevant laws, the recruitment may be canceled. Additional screening and document verification may be conducted beyond the recruitment process announced in advance, if necessary.

National meritorious persons are given preferential treatment according to relevant laws; if applicable, please notify us when applying and submit supporting documents upon hiring.

When applying for a position at Hyperconnect, this privacy policy applies to the processing of personal information: https://career.hyperconnect.com/privacy

#HPCNT

36 Skills Required For This Role

Communication Problem Solving Resource Allocation Github Cpp Talent Acquisition Data Structures Game Texts Resource Planning Html User Experience Ux Linux Aws Rust Azure Model Serving Prometheus Ansible Terraform Grafana Spark Model Deployment Pytorch Deep Learning Ci Cd Docker Kubernetes Kotlin Python Algorithms Tensorflow Swift Javascript Java Accounting Machine Learning

Similar Jobs