DataDirect Networks (DDN) is a global market leader in AI and high-performance data storage innovation. They are seeking a Staff Replication Development Engineer to lead the design and development of the replication engine for the Infinia AI Data Platform, focusing on building enterprise-grade asynchronous replication capabilities for large-scale data systems.
Responsibilities:
- Design and develop multi-threaded asynchronous replication systems with parallel streaming capabilities
- Build object-level delta replication with checkpointing and resume functionality
- Develop replication engines supporting bucket/share-level replication controls
- Implement secure data transfer mechanisms using TLS 1.3 with mutual authentication
- Ensure end-to-end data integrity through checksum validation and verification pipelines
- Design and implement manual failover workflows for disaster recovery scenarios
- Build and maintain REST APIs for replication configuration, control, and automation
- Develop metadata tracking and change detection systems to enable efficient replication
- Implement RPO visibility, alerting, and operational insights for replication status
- Contribute to monitoring dashboards focused on replication health and performance
- Ensure systems are designed for high availability, fault tolerance, and scalability
- Partner with QA teams to drive performance, resiliency, and scale validation
- Collaborate with backend, security, and platform teams to deliver end-to-end replication workflows
- Participate in debugging, production issue resolution, and continuous improvement of replication reliability
- Provide technical leadership, architectural guidance, and mentorship to the engineering team
Requirements:
- 8+ years of experience in distributed systems, storage systems, or backend software engineering
- Strong programming skills in one or more languages: C++, Go, Java, or Rust
- Experience designing and building data replication systems, data pipelines, or distributed data services
- Deep understanding of distributed systems concepts (consistency, availability, scalability, fault tolerance)
- Strong expertise in multi-threading, concurrency, and parallel processing
- Knowledge of networking protocols and secure communication (TCP/IP, HTTP/HTTPS, TLS)
- Experience implementing data integrity mechanisms (checksums, validation, consistency checks)
- Experience designing and building REST APIs and service-based architectures
- Familiarity with checkpointing, failure recovery, and retry mechanisms in distributed systems
- Basic understanding of observability concepts (metrics, logging, alerting)
- Strong debugging, problem-solving, and system design skills
- Experience with asynchronous replication, disaster recovery (DR), or backup systems
- Familiarity with object storage or large-scale data storage systems
- Knowledge of delta encoding, change data capture, or incremental data synchronization techniques
- Experience building high-throughput, low-latency data movement systems
- Exposure to security practices including mutual TLS, encryption, and authentication
- Experience working on enterprise-scale data platforms or storage products
- Familiarity with performance optimization and large-scale system tuning