Senior Data Engineer (PySpark)

1 Minute ago • 5-8 Years • Data Analysis

Job Summary

Job Description

We are looking for a highly skilled and experienced Senior Data Engineer to join our team in Bangalore. The ideal candidate will have a strong background in building scalable and high-performance data pipelines using PySpark and the Apache ecosystem. This role involves close collaboration with Data Scientists, Analysts, and cross-functional teams to drive robust data solutions.

Must have:

Design, develop, and optimize distributed data pipelines using PySpark.
Work with Apache tools like Hadoop, Hive, HDFS for large-scale data processing.
Ensure performance, reliability, and scalability of ETL workflows.
Collaborate with stakeholders to gather requirements and deliver scalable data solutions.
Implement robust data quality checks and lineage tracking.
Handle data integration from diverse structured and unstructured sources.
Write clean and maintainable code primarily in Python, with working knowledge of Java.
Participate in architectural discussions and performance tuning.
5–7 years of experience in data engineering roles.
Expertise in PySpark for distributed computing and data transformation.
Strong understanding of Apache ecosystem (Hadoop, Hive, Spark, HDFS).
Knowledge of ETL principles, data modeling, and data warehousing concepts.
Experience working with large-scale datasets and optimizing performance.
Hands-on proficiency with SQL and exposure to NoSQL databases.
Solid coding skills in Python, with working knowledge of Java.
Experience with version control (Git) and working in CI/CD environments.

Good to have:

Utilize Apache NiFi for automated data flow orchestration (if applicable).

11 skills required

11 skills required for this role

Add these skills to join the top 1% applicants for this job

cross-functional

github

game-texts

nosql

hadoop

spark

ci-cd

git

python

sql

java

Job Details

Job Summary:

Key Responsibilities:

Design, develop, and optimize distributed data pipelines using PySpark.
Work with Apache tools such as Hadoop, Hive, HDFS, and others for large-scale data ingestion, transformation, and processing.
Ensure the performance, reliability, and scalability of ETL workflows in production environments.
Collaborate with stakeholders to gather requirements and deliver scalable data solutions.
Implement robust data quality checks and lineage tracking for auditability and transparency.
Handle data integration from diverse structured and unstructured sources.
Utilize Apache NiFi (if applicable) for automated data flow orchestration.
Write clean and maintainable code primarily in Python, with working knowledge of Java.
Participate in architectural discussions and performance tuning initiatives.