
Job Title: Senior SRE RunOps Engineer
Location: Irving, TX (Onsite)
Job Summary
We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to join a Production Support team responsible for ensuring the reliability, performance, and stability of enterprise production systems across cloud infrastructure, databases, and application services. The ideal candidate will possess strong operational expertise, advanced troubleshooting capabilities, and a passion for building resilient and scalable systems.
Key Responsibilities
Production Support & Incident Management
Serve as a primary responder for production incidents, ensuring rapid triage, mitigation, and resolution.
Lead Root Cause Analysis (RCA) efforts and drive long-term corrective actions.
Maintain and improve incident response processes, runbooks, and escalation procedures.
Collaborate with Engineering, QA, and Product teams to prevent recurring issues.
Participate in on-call rotations and provide release support and production verification.
Cloud Infrastructure Operations
Support and optimize cloud services including compute, container platforms, serverless functions, storage, monitoring, identity management, databases, and networking.
Monitor system health, performance, and capacity across cloud environments.
Implement best practices around reliability, scalability, security, and cost optimization.
Assist with deployments, environment configurations, and CI/CD processes.
Database & Storage Support
Manage and troubleshoot MongoDB environments, including replication, backups, failover, and performance tuning.
Diagnose query performance issues and collaborate with development teams on optimization.
Ensure data availability, integrity, and disaster recovery readiness.
Monitoring & Observability
Build and maintain dashboards, alerts, and monitoring solutions.
Leverage observability tools to monitor application and infrastructure performance.
Continuously refine alerting strategies to reduce noise and improve operational efficiency.
Required Skills
10+ years of experience in Site Reliability Engineering, DevOps, Production Support, or related operational roles.
Strong hands-on experience with AWS services and cloud-native architectures.
Experience supporting 24x7 production environments and participating in on-call rotations.
Strong MongoDB administration and troubleshooting experience.
Experience with monitoring and observability tools such as New Relic, CloudWatch, Mongo Charts, and ServiceNow.
Strong understanding of Linux, networking, and distributed systems.
Proficiency in scripting using Python, Bash, or similar languages.
Experience managing high-severity incidents and conducting Root Cause Analysis.
Familiarity with CI/CD tools such as Jenkins, GitHub Actions, or GitLab CI.
Experience with Postman, Firebase, and Microsoft Intune.
Preferred Skills
Experience with container orchestration platforms such as Kubernetes, ECS, or EKS.
Knowledge of messaging technologies including Kafka, SQS, or RabbitMQ.
Experience working with microservices-based architectures.
Exposure to IoT devices and endpoint management solutions.
AWS or MongoDB certifications are a plus.