Dice is a company seeking a Big Data Dev/Spark Scala Engineer. The role involves designing and maintaining large-scale Spark applications and building streaming data pipelines, while ensuring data quality and collaborating with various teams on production readiness.
Responsibilities:
- Design, develop, and maintain large scale Spark applications using Scala and PySpark
- Build and operate streaming heavy data pipelines using Kafka and Spark Structured Streaming
- Implement stateful streaming patterns including windowing, watermarking, late data handling, and checkpointing
- Develop robust event replay and reprocessing workflows using Kafka offsets and partitions
- Build ingestion and routing flows using Apache NiFi, including Kafka based ingestion patterns
- Implement end to end ETL/ELT pipelines with strong emphasis on low latency, fault tolerance, and scalability
- Optimize Spark jobs through partitioning strategies, memory tuning, shuffle optimization, and efficient data formats
- Integrate Spark workloads with distributed object storage systems such as Apache Ozone and Ceph
- Ensure data quality, consistency, and auditability through validation, reconciliation, and metadata capture
- Collaborate with platform, infrastructure, and operations teams on production readiness and capacity planning
- Support production systems, including monitoring, incident analysis, and root cause resolution
- Contribute to reusable frameworks, coding standards, and engineering best practices
- Participate in architecture reviews, code reviews, and technical documentation
Requirements:
- Experience Required - 7+ Years
- Experience with Apache Ozone and/or Ceph as storage backends for analytics workloads
- Experience implementing exactly once / at least once streaming semantics
- Strong background in Spark performance tuning (CPU, memory, I/O, shuffle)
- Experience supporting mission critical production systems with strict SLAs
- Familiarity with CI/CD pipelines and automated testing for data applications
- Experience designing observability for streaming systems (lag, throughput, backpressure)
- Languages: Scala, Python (PySpark), SQL
- Big Data: Apache Spark (Core, SQL, Structured Streaming)
- Streaming: Kafka
- Ingestion / Orchestration: Apache NiFi
- Storage: Apache Ozone, Ceph, object storage concepts
- OS & Tooling: Linux, Git, CI/CD, monitoring and logging tools
- Bachelors degree in Computer Science, Engineering, or equivalent practical experience
- Strong hands on experience with Apache Spark in production environments
- Advanced proficiency in Scala and PySpark
- Solid understanding of distributed systems and data processing at scale
- Strong experience with Kafka based streaming architectures
- Hands on experience with Spark Structured Streaming
- Experience building batch and real time pipelines
- Hands on experience with Apache NiFi for data ingestion and flow management
- Strong SQL skills and experience working with structured and semi structured data
- Experience working with object storage or distributed storage platforms
- Proficiency with Linux, shell scripting, and Git based version control