Nscale is the GPU cloud engineered for AI, providing infrastructure for AI start-ups and large enterprises. The Infrastructure Software Engineer for Fleet & Automation will ensure performance and scalability of AI and HPC environments, focusing on building and maintaining automation and control systems.

Responsibilities:

Perform technical architecture, roadmap and implementation for workflow automation systems, driving architecture decisions that balance automation complexity, reliability, and maintainability. Identify and resolve performance and scalability issues. Establish technology and product direction in collaboration with other tech leads, managers, and senior leadership
Own end-to-end delivery of device provisioning, validation, testing, and remediation workflows at scale
Design and build workflow orchestration systems for hardware lifecycle management, including GPU nodes and network switches
Partner with Infrastructure, Platform, and SRE teams to translate operational needs into robust, scalable automation
Establish engineering standards for reliability, observability, and operational excellence across all services. Help set up engineering best practices in collaboration with the broader engineering team
Build production-grade Python systems for hardware lifecycle automation, leveraging AI tools to accelerate delivery. Assess impact to team software stack from new hardware product programs and explore AI driven process improvement and automation
Collaborate with cross-functional teams (product, design, operations, infrastructure) to build efficient, interoperable, and maintainable automated systems

Requirements:

Education: Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
Experience: 5+ years relevant experience building large-scale infrastructure applications or similar experience
Programming: Experience in utilizing languages such as C, C++, Java, and scripting languages such as Python for API design and unit testing techniques
Systems Expertise: Deep understanding of Linux operating systems, networking fundamentals (TCP/IP, BGP), and familiarity with configuration management tools (e.g., Ansible, Terraform)
Distributed Systems: Experience building, running and debugging large-scale infrastructure, stateful and stateless services for distributed systems or networks, and experience with compute technologies, storage, or hardware architecture. Experience integrating with infrastructure tooling such as: DCIMs, NetBox, OpenStack, bare metal APIs (MAAS, Ironic, IPMI)
Master's degree or PhD in Engineering, Computer Science, or a related technical field
Experience designing, analyzing and improving efficiency, scalability, and performance of various system resources
Direct experience with AI/HPC infrastructure, including NVIDIA GPUs, InfiniBand or high-speed Ethernet fabrics, and related management software (e.g., NCCL, SLURM)
Experience with advanced observability and monitoring systems (Prometheus, Grafana, OpenTelemetry) for complex, high-cardinality telemetry data
Familiarity with cloud-native technologies (Kubernetes, Docker) and infrastructure-as-code principles
Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
Familiarity with SLOs/metrics measurement, logs/telemetry/metrics integration with tools for enhanced operator experience

Infrastructure Software Engineer, Fleet & Automation

Key skills

About this role

Responsibilities:

Requirements: