DataDirect Networks (DDN) is a global market leader in AI and high-performance data storage innovation. They are seeking a Staff Replication Development Engineer to lead the design and development of the replication engine for the Infinia AI Data Platform, focusing on building enterprise-grade asynchronous replication capabilities for large-scale data systems.

Responsibilities:

Design and develop multi-threaded asynchronous replication systems with parallel streaming capabilities
Build object-level delta replication with checkpointing and resume functionality
Develop replication engines supporting bucket/share-level replication controls
Implement secure data transfer mechanisms using TLS 1.3 with mutual authentication
Ensure end-to-end data integrity through checksum validation and verification pipelines
Design and implement manual failover workflows for disaster recovery scenarios
Build and maintain REST APIs for replication configuration, control, and automation
Develop metadata tracking and change detection systems to enable efficient replication
Implement RPO visibility, alerting, and operational insights for replication status
Contribute to monitoring dashboards focused on replication health and performance
Ensure systems are designed for high availability, fault tolerance, and scalability
Partner with QA teams to drive performance, resiliency, and scale validation
Collaborate with backend, security, and platform teams to deliver end-to-end replication workflows
Participate in debugging, production issue resolution, and continuous improvement of replication reliability
Provide technical leadership, architectural guidance, and mentorship to the engineering team

Requirements:

8+ years of experience in distributed systems, storage systems, or backend software engineering
Strong programming skills in one or more languages: C++, Go, Java, or Rust
Experience designing and building data replication systems, data pipelines, or distributed data services
Deep understanding of distributed systems concepts (consistency, availability, scalability, fault tolerance)
Strong expertise in multi-threading, concurrency, and parallel processing
Knowledge of networking protocols and secure communication (TCP/IP, HTTP/HTTPS, TLS)
Experience implementing data integrity mechanisms (checksums, validation, consistency checks)
Experience designing and building REST APIs and service-based architectures
Familiarity with checkpointing, failure recovery, and retry mechanisms in distributed systems
Basic understanding of observability concepts (metrics, logging, alerting)
Strong debugging, problem-solving, and system design skills
Experience with asynchronous replication, disaster recovery (DR), or backup systems
Familiarity with object storage or large-scale data storage systems
Knowledge of delta encoding, change data capture, or incremental data synchronization techniques
Experience building high-throughput, low-latency data movement systems
Exposure to security practices including mutual TLS, encryption, and authentication
Experience working on enterprise-scale data platforms or storage products
Familiarity with performance optimization and large-scale system tuning

Staff Replication Development Engineer

Key skills

About this role

Responsibilities:

Requirements: