Position Title: Lead Site Reliability Architect OSS/BSS & Mainframe

Location: TX, NJ, NC and FL USA

Work Arrangement: Hybrid/Onsite

Interview Type: Video

Must have:

15+ years of progressive experience in enterprise IT and telecommunications environments, with extensive expertise in designing, implementing, and supporting complex OSS/BSS ecosystems that enable large-scale business and network operations.
8+ years of hands-on architecture experience across IBM Mainframe z/OS and midrange platforms (Linux/Solaris), delivering scalable, secure, and highly available enterprise solutions.
Demonstrated expertise in Site Reliability Engineering (SRE) principles, including defining and managing Service Level Objectives (SLOs), Service Level Indicators (SLIs), Error Budgets, reliability governance, and continuous service improvement.
Deep functional and technical knowledge of Telcordia OSS applications, including SWITCH, TIRKS, FACS, WFA, and SOAC, with experience integrating and optimizing telecom operational support systems.
Proven ability to design and implement high-availability, fault-tolerant, resilient, and disaster recovery architectures, ensuring business continuity and mission-critical system reliability.
Strong hands-on expertise with IBM Mainframe technologies, including z/OS internals, JCL, IMS, VSAM, DB2, CICS, system utilities, workload management, performance tuning, and production diagnostics.
Extensive experience implementing observability and monitoring solutions using industry-leading tools such as Splunk, Dynatrace, Instana, IBM NetCool, Grafana, and AppDynamics to improve operational visibility and proactive incident detection.
Proven success in driving automation, self-healing capabilities, infrastructure as code, CI/CD reliability practices, and DevOps/SRE transformation across hybrid cloud and on-premises enterprise environments.
Strong understanding of end-to-end telecommunications business processes, including service provisioning, inventory management, order management, activation, network fulfillment, service assurance, and lifecycle management.
Extensive experience leading major incident management, conducting Root Cause Analysis (RCA), problem management, and implementing preventive measures to significantly improve MTTD (Mean Time to Detect), MTTR (Mean Time to Resolve), system stability, and operational excellence.
Proven ability to collaborate with cross-functional teams including Enterprise Architecture, Infrastructure, Development, Operations, Network Engineering, and business stakeholders to deliver highly reliable, business-critical technology solutions.
Excellent leadership, stakeholder management, and communication skills, with a strong track record of mentoring technical teams, driving reliability engineering best practices, and supporting large-scale enterprise transformation initiatives.

Lead Site Reliability Architect

Key skills

About this role