Lead Data Engineer

Guardian

Job Summary

As a Lead Data Engineer, you will be responsible for leading the technical design and implementation of data engineering and MLOps solutions. This role involves mentoring junior engineers, analyzing raw data to create curated assets for ML and BI, and building scalable data pipelines. You will also develop and maintain end-to-end MLOps lifecycles, implement monitoring frameworks, and contribute to advanced data and ML engineering strategies, ensuring high-quality deliverables and continuous improvement.

Must Have

  • Lead technical design and implementation of data engineering and MLOps solutions.
  • Mentor and guide junior engineers, conducting code reviews and technical sessions.
  • Perform detailed analysis of raw data sources and transform into curated data assets for ML and BI.
  • Create scalable and trusted data pipelines in centralized data lake/data warehouse ecosystems.
  • Monitor and troubleshoot data pipeline performance.
  • Extract text data from various sources to support NLP/LLM solutions.
  • Collaborate with data science and data engineering teams to build scalable ML pipelines.
  • Lead development and maintenance of end-to-end MLOps lifecycle.
  • Implement robust data drift and model monitoring frameworks.
  • Develop real-time data solutions by creating new API endpoints or streaming frameworks.
  • Develop, test, and maintain robust tools, frameworks, and libraries.
  • Leverage public/private APIs for extracting data and invoking functionalities.
  • Collaborate with cross-functional teams of Data Science, Data Engineering, business units, and IT.
  • Create and maintain effective documentation for projects and practices.
  • Provide technical leadership and mentorship on continuous improvement.
  • Contribute to enhancing strategy for advanced data & ML engineering practices.
  • Bachelor’s or Master’s degree with 8+ years of experience in Computer Science, Data Science, Engineering, or related field.
  • 5+ years of experience working with Python, SQL, PySpark, and bash scripts.
  • Proficient in software development lifecycle and software engineering practices.
  • 3+ years of experience developing and maintaining robust data pipelines for ML models.
  • 3+ years of experience working with Cloud Data Warehousing platforms (Redshift, Snowflake, Databricks SQL) and distributed frameworks like Spark.
  • 2+ years of hands-on experience using Databricks platform (MLFlow, Model Registry, Databricks Workflow, Job Clusters, Databricks CLI, Workspace).
  • 2+ years of experience leading a team of engineers.
  • Solid understanding of machine learning lifecycle, data mining, and ETL techniques.
  • Experience with machine learning frameworks (scikit-learn, xgboost, Keras, PyTorch) and operationalizing models in production.
  • Proficiency in understanding REST APIs and using different types of APIs.
  • Familiarity with Pythonic API development frameworks like Flask/FastAPI and containerization frameworks like Docker/Kubernetes.
  • Hands-on experience building and maintaining tools and libraries used by multiple teams.
  • Proficient in understanding and incorporating software engineering principles.
  • Hands-on experience with CI/CD tools (Jenkins), version control (Github, Bitbucket), orchestration (Airflow, Prefect).
  • Excellent communication skills and ability to work and collaborate with cross-functional teams.

Good to Have

  • Understanding of Large Language Models (LLMs) and MLOps lifecycle for operationalizing LLM models.
  • Familiarity with GPU compute for model training or inference.
  • Familiarity with deep learning frameworks and deploying deep learning models for production use cases.

Job Description

Job Description:

You will

  • Lead technical design and implementation of data engineering and MLOps solutions, ensuring best practices and high-quality deliverables.
  • Mentor and guide junior engineers, conducting code reviews and technical sessions to foster team growth.
  • Perform detailed analysis of raw data sources by applying business context and collaborate with cross-functional teams to transform raw data into curated & certified data assets for ML and BI use cases.
  • Create scalable and trusted data pipelines which generate curated data assets in centralized data lake/data warehouse ecosystems.
  • Monitor and troubleshoot data pipeline performance, identifying and resolving bottlenecks and issues.
  • Extract text data from a variety of sources (documents, logs, databases, web scraping) to support development of NLP/LLM solutions.
  • Collaborate with data science and data engineering teams to build scalable and reproducible machine learning pipelines for training and inference.
  • Lead development and maintenance of end-to-end MLOps lifecycle to automate machine learning solutions development and delivery.
  • Implement robust data drift and model monitoring frameworks across pipelines.
  • Develop real-time data solutions by creating new API endpoints or streaming frameworks.
  • Develop, test, and maintain robust tools, frameworks, and libraries that standardize and streamline the data & machine learning lifecycle.
  • Leverage public/private APIs for extracting data and invoking functionalities as required for use cases.
  • Collaborate with cross-functional teams of Data Science, Data Engineering, business units, and IT teams.
  • Create and maintain effective documentation for projects and practices, ensuring transparency and effective team communication.
  • Provide technical leadership and mentorship on continuous improvement in building reusable and scalable solutions.
  • Contribute to enhancing strategy for advanced data & ML engineering practices and lead execution of key initiatives of technical strategy.
  • Stay up-to-date with the latest trends in modern data engineering, machine learning & AI.

You have

  • Bachelor’s or Master’s degree with 8+ years of experience in Computer Science, Data Science, Engineering, or a related field.
  • 5+ years of experience working with Python, SQL, PySpark, and bash scripts. Proficient in software development lifecycle and software engineering practices.
  • 3+ years of experience developing and maintaining robust data pipelines for both structured and unstructured data to be used by Data Scientists to build ML Models.
  • 3+ years of experience working with Cloud Data Warehousing (Redshift, Snowflake, Databricks SQL or equivalent) platforms and distributed frameworks like Spark.
  • 2+ years of hands-on experience using Databricks platform for data engineering and MLOps, including MLFlow, Model Registry, Databricks Workflow, Job Clusters, Databricks CLI, and Workspace.
  • 2+ years of experience leading a team of engineers and a track record of delivering robust and scalable data solutions with highest quality.
  • Solid understanding of machine learning lifecycle, data mining, and ETL techniques.
  • Experience with machine learning frameworks (scikit-learn, xgboost, Keras, PyTorch) and operationalizing models in production.
  • Proficiency in understanding REST APIs, experience using different types of APIs to extract data or perform functionalities.
  • Familiarity with Pythonic API development frameworks like Flask/FastAPI and containerization frameworks like Docker/Kubernetes.
  • Hands-on experience building and maintaining tools and libraries used by multiple teams across the organization (e.g., Data Engineering utility libraries, DQ Libraries).
  • Proficient in understanding and incorporating software engineering principles in design & development process.
  • Hands-on experience with CI/CD tools (e.g., Jenkins or equivalent), version control (Github, Bitbucket), orchestration (Airflow, Prefect or equivalent).
  • Excellent communication skills and ability to work and collaborate with cross-functional teams across technology and business.

Good to have

  • Understanding of Large Language Models (LLMs) and MLOps lifecycle for operationalizing LLM models.
  • Familiarity with GPU compute for model training or inference.
  • Familiarity with deep learning frameworks and deploying deep learning models for production use cases.

Location:

This position can be based in any of the following locations:

Chennai

Current Guardian Colleagues: Please apply through the internal Jobs Hub in Workday

21 Skills Required For This Role

Cross Functional Communication Github Game Texts Spark Fastapi Data Science Scikit Learn Pytorch Deep Learning Ci Cd Docker Flask Kubernetes Python Keras Sql Bitbucket Bash Jenkins Machine Learning

Similar Jobs