Databricks is a leading data and AI company that empowers teams to tackle complex challenges through innovative solutions. The Staff Backline Engineer will troubleshoot and optimize the Data and AI infrastructure, ensuring the stability and reliability of production workloads while driving product improvements and operational excellence.
Responsibilities:
- Conduct deep-dive forensics into Spark core internals and the broader Databricks Data and AI ecosystem to resolve high-priority architectural failures and complex system anomalies
- Perform advanced code-level analysis and resource profiling to identify and mitigate systemic root causes, ensuring the stability and reliability of high-scale production workloads
- Optimise architectural performance across the Data and AI stack by refining execution parameters and enforcing best practice strategies to maximise resource efficiency and throughput
- Analyse global issue trends and patterns to partner directly with Product Engineering, influencing the product roadmap and driving initiatives that enhance long-term supportability
- Develop reproduction frameworks, automated workflows, and AI-driven diagnostic tools that translate complex backline findings into standardised resolution paths to empower and scale the broader organisation
Requirements:
- 10+ years of relevant experience
- Deep expertise in one of the following three specialized tracks: Data Engineering, Product Supportability, or AI
- Proven experience in managing both customers and technical stakeholders
- For Data Engineering Track: Expertise in large-scale big data solutions and ETL pipelines using Spark, Delta Lake, or Hive
- Strong experience troubleshooting failures, diagnosing performance issues, and identifying root causes
- Demonstrated problem-solving ability and understanding of data engineering best practices
- Solid hands-on programming skills in Python, SQL, or Scala
- For Product Supportability Track: Deep understanding of distributed system internals
- Ability to perform code-level root-cause analysis and profiling in Java, Scala, or Python
- Proven record of contributing to bug fixes and mentoring other engineers
- For AI Track: Experience with large-scale machine learning and generative AI systems
- Strong grasp of model training, evaluation, and deployment in distributed environments
- Experience managing the ML lifecycle, including governance and operationalisation
- Skilled in diagnosing and optimising distributed ML workloads for performance and scalability