Leidos Australia is an industry and technology leader serving government and commercial customers. They are seeking a Senior Cloud Engineer to support the SEC ISS contract by engineering, operating, and continuously improving the enterprise observability platform across hybrid cloud and containerized environments.
Responsibilities:
- Engineer and operate the enterprise observability stack (Datadog or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring
- Build, tune, and maintain dashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noise
- Instrument services, infrastructure, and containerized workloads using agents, OpenTelemetry, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3C TraceContext propagation, and unified service tagging across the estate
- Develop and maintain integrations between observability platforms, ITSM (ServiceNow), CI/CD pipelines, and on-call/paging workflows
- Define and enforce a unified tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to keep telemetry queryable, attributable, and cost-controlled
- Design and deliver monitoring coverage for Microsoft Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data services
- Engineer managed database observability across AWS RDS/Aurora (MySQL, PostgreSQL, SQL Server, Oracle), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB, ElastiCache/Redis), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, connection pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM traces
- Engineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes, pods, namespaces, ingress, service mesh, and workload APM
- Build standardized, reusable monitoring modules deployable via infrastructure-as-code (Terraform, Bicep, ARM) and CI/CD
- Support hybrid visibility across on-premises, cloud, and containerized workloads with correlated telemetry
- Lead data-driven investigation and resolution of complex performance, latency, saturation, and reliability issues across the estate
- Use APM distributed traces, service/dependency maps, continuous code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM-to-backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and downstream dependencies
- Partner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidence
- Define and implement trace-based SLOs, deployment tracking, and change-correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changes
- Provide senior technical leadership during major incidents, delivering impact analysis, contributing to root-cause analysis, and owning post-incident observability gaps
- Analyze operational telemetry and trend data to identify capacity risks, recurring constraints, and opportunities for efficiency
- Build and maintain capacity and performance dashboards and reports that communicate posture, risk, and recommendations to technical and leadership stakeholders
- Define capacity thresholds, alert baselines, and trigger points for scaling, technology refresh, and resource reallocation
- Drive continuous improvement of observability coverage, alert quality, runbook linkage, and operational maturity aligned to SEC SLA/KPI expectations
Requirements:
- Citizenship/Work Authorization: Must meet contract requirements
- Clearance: Ability to obtain and maintain SEC Public Trust (or higher if required)
- Minimum 8 years of experience in IT infrastructure or platform engineering roles, including 5+ years focused on observability, performance engineering, or site reliability engineering
- Demonstrated experience engineering and operating an enterprise observability platform (Datadog strongly preferred; equivalent experience with Dynatrace, New Relic, Splunk Observability, or Grafana/Prometheus stacks considered)
- Proven experience building APM and distributed tracing coverage for production multi-tier applications -- including language-specific tracer deployment, custom instrumentation of business transactions, service/dependency mapping, continuous profiling, and RUM-to-backend trace correlation -- across cloud and containerized workloads
- Proven experience leading complex production performance and reliability problem-solving from telemetry to remediation
- Hands-on experience monitoring Kubernetes or OpenShift clusters and containerized workloads in production
- Enterprise observability platforms (Datadog or comparable): metrics, logs, traces, APM, RUM, synthetic, NPM
- Instrumentation with OpenTelemetry, Datadog agents/SDKs, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) including custom spans, trace sampling strategies, W3C TraceContext propagation, and continuous profiling
- Microsoft Azure and AWS monitoring services and integrations (Azure Monitor, Log Analytics, CloudWatch, AWS X-Ray)
- Container and Kubernetes/OpenShift observability, including cluster, workload, and service mesh telemetry
- Cloud database monitoring: AWS RDS/Aurora (including Performance Insights), Azure SQL/PostgreSQL/MySQL (Query Performance Insight), and NoSQL/cache (DynamoDB, Cosmos DB, ElastiCache/Redis); query-level performance tuning, execution-plan analysis, and Datadog DBM or equivalent deep database APM
- Infrastructure-as-code for monitoring (Terraform, Bicep, ARM) and CI/CD-driven monitor/dashboard deployment
- APM and distributed tracing: service/dependency maps, trace analytics, RUM-to-backend correlation, exception/error tracking, deployment tracking, and trace-based SLOs
- Unified tagging strategy and cardinality governance across metrics/logs/traces (environment, service, version, ownership, data classification, cost center), including custom tag enrichment and tag-driven access/cost controls
- Alert engineering, SLO/SLI design, error budget management, and alert-noise reduction
- Performance engineering, capacity analysis, and telemetry-driven root-cause analysis
- Integration of observability with ITSM (ServiceNow) and on-call/paging workflows
- Experience supporting federal agency IT environments under FISMA/FedRAMP/NIST-aligned security and compliance requirements
- Datadog certification (Fundamentals and/or Administrator) or comparable enterprise observability certification
- Hands-on experience with Red Hat OpenShift Virtualization (CNV/KubeVirt) or other KubeVirt-based container virtualization observability
- Experience with eBPF-based observability tooling and service mesh telemetry (Istio, Linkerd)
- Experience implementing SLOs and error budgets at enterprise scale and integrating them into operational governance
- Experience with cost-aware observability practices, including telemetry volume optimization and retention tuning
- Experience integrating observability outputs with executive reporting, SLA/KLI dashboards, and capacity forecasting
- ITIL 4 Foundation
- AWS Certified Solutions Architect - Associate (or higher)
- Microsoft Certified: Azure Administrator Associate (or higher)
- Red Hat Certified Specialist in OpenShift Administration (or equivalent)
- HashiCorp Terraform Associate