Celestica is seeking an experienced full-stack software developer to design, develop, and test AI-enabled automation infrastructure for data center racks. The role involves leading the development of a comprehensive control center for managing global network automation test infrastructure, collaborating with various teams to ensure high-quality software delivery.
Responsibilities:
- Lead the design, development and implementation of technical solutions for complex projects, involving multiple domains. Participate in project planning and scheduling
- Global SME with comprehensive knowledge and industry recognition. Provides technical leadership and direction to a global team of engineers
- Take responsibility for non-technical elements of an engineering project (people, financials etc.)
- Review and interpret customer specifications and may act as primary customer contact
- Analyze trade-offs in complex systems and recommend solutions. Develops deployment strategies and plans
- Lead the deployment of strategic complex programs and coordinate site-wide deployment efforts
- May manage relationships with key vendors/partners
- Analyze, design and develop tests and test-automation suites
- Design and develop a processing platform using various configuration management technologies
- Test software development methodology (may be done in agile environment)
- Provide ongoing maintenance, support and enhancements in existing systems and platforms
- Collaborate cross-functionally with customers, users, project managers and other engineers including Peer-Reviews to achieve elegant solutions
- Provide recommendations for continuous improvement
- Work alongside other engineers on the team to elevate technology and consistently apply best practices
- Keep up to date with relevant industry knowledge and regulations
- Architect a CI/CD Pipeline: Design the integration between Git-based workflows and physical hardware labs, ensuring code changes trigger automated builds and deployments to SONiC-based switches
- Lead the development of a cloud-hosted GUI and backend services that securely manage and command on-premise physical test beds
- Oversee the management of physical test beds, ensuring consistent state and availability for automated testing
- Standardize automated testing using SPyTest, ensuring robust coverage for NOS (Network Operating System) features
- Integrate IXIA traffic generators into the automated suite to perform high-scale performance, stress, and regression testing
- Own the final validation gate, ensuring that no code reaches production without passing a rigorous, automated physical battery
- Build and deploy AI/LLM-based agents to parse complex log files and SPyTest results to identify the "root cause" of test failures automatically
- Develop agents capable of test bed failure recovery (e.g., automatically power-cycling hung PDUs, re-flashing corrupted ONIE images, or re-seating virtual links)
- Leverage AI to analyze long-term software quality trends and predict potential regressions before they occur
Requirements:
- 12 to 18 years of experience
- Bachelor degree or consideration of an equivalent combination of education and experience
- Deep expertise in SONiC, SAI (Switch Abstraction Interface), and standard protocols (BGP, EVPN, VXLAN)
- Expert-level knowledge of SPyTest and Python-based automation
- Experience with IXIA (IxNetwork/IxLoad) and physical switch hardware (Mellanox/NVIDIA, Broadcom-based whitebox)
- Strong proficiency in Python, C/C++, Rust, or Java; experience building RESTful APIs and cloud-native backends (GCP/Azure)
- Familiarity with integrating LLM APIs (like Google Gemini) for text/log analysis
- Advanced experience with GitHub Actions, Azure DevOps or Jenkins, and containerization (Docker/Kubernetes)
- Lead the design, development and implementation of technical solutions for complex projects, involving multiple domains
- Participate in project planning and scheduling
- Global SME with comprehensive knowledge and industry recognition
- Provide technical leadership and direction to a global team of engineers
- Take responsibility for non-technical elements of an engineering project (people, financials etc.)
- Review and interpret customer specifications and may act as primary customer contact
- Analyze trade-offs in complex systems and recommend solutions
- Develop deployment strategies and plans
- Lead the deployment of strategic complex programs and coordinate site-wide deployment efforts
- May manage relationships with key vendors/partners
- Analyze, design and develop tests and test-automation suites
- Design and develop a processing platform using various configuration management technologies
- Test software development methodology (may be done in agile environment)
- Provide ongoing maintenance, support and enhancements in existing systems and platforms
- Collaborate cross-functionally with customers, users, project managers and other engineers including Peer-Reviews to achieve elegant solutions
- Provide recommendations for continuous improvement
- Work alongside other engineers on the team to elevate technology and consistently apply best practices
- Keep up to date with relevant industry knowledge and regulations
- Architect a CI/CD Pipeline: Design the integration between Git-based workflows and physical hardware labs, ensuring code changes trigger automated builds and deployments to SONiC-based switches
- Cloud-to-On-Prem Connectivity: Lead the development of a cloud-hosted GUI and backend services that securely manage and command on-premise physical test beds
- Hardware Abstraction: Oversee the management of physical test beds, ensuring consistent state and availability for automated testing
- Framework Leadership: Standardize automated testing using SPyTest, ensuring robust coverage for NOS (Network Operating System) features
- Traffic Emulation: Integrate IXIA traffic generators into the automated suite to perform high-scale performance, stress, and regression testing
- Regression Management: Own the final validation gate, ensuring that no code reaches production without passing a rigorous, automated physical battery
- Failure Analysis Agents: Build and deploy AI/LLM-based agents to parse complex log files and SPyTest results to identify the 'root cause' of test failures automatically
- Self-Healing Test Beds: Develop agents capable of test bed failure recovery (e.g., automatically power-cycling hung PDUs, re-flashing corrupted ONIE images, or re-seating virtual links)
- Quality Insights: Leverage AI to analyze long-term software quality trends and predict potential regressions before they occur
- Active contributor to the Azure/SONiC open-source community
- Experience building custom dashboards using React or Vue.js
- Knowledge of deploying and operating software within GCP
- Background in developing 'Self-Healing' infrastructure or AIOps