Leidos Australia is an industry and technology leader serving government and commercial customers. They are seeking a Senior Cloud Engineer to support the SEC ISS contract by engineering, operating, and continuously improving the enterprise observability platform across hybrid cloud and containerized environments.

Responsibilities:

Engineer and operate the enterprise observability stack (Datadog or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring
Build, tune, and maintain dashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noise
Instrument services, infrastructure, and containerized workloads using agents, OpenTelemetry, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3C TraceContext propagation, and unified service tagging across the estate
Develop and maintain integrations between observability platforms, ITSM (ServiceNow), CI/CD pipelines, and on-call/paging workflows
Define and enforce a unified tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to keep telemetry queryable, attributable, and cost-controlled
Design and deliver monitoring coverage for Microsoft Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data services
Engineer managed database observability across AWS RDS/Aurora (MySQL, PostgreSQL, SQL Server, Oracle), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB, ElastiCache/Redis), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, connection pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM traces
Engineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes, pods, namespaces, ingress, service mesh, and workload APM
Build standardized, reusable monitoring modules deployable via infrastructure-as-code (Terraform, Bicep, ARM) and CI/CD
Support hybrid visibility across on-premises, cloud, and containerized workloads with correlated telemetry
Lead data-driven investigation and resolution of complex performance, latency, saturation, and reliability issues across the estate
Use APM distributed traces, service/dependency maps, continuous code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM-to-backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and downstream dependencies
Partner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidence
Define and implement trace-based SLOs, deployment tracking, and change-correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changes
Provide senior technical leadership during major incidents, delivering impact analysis, contributing to root-cause analysis, and owning post-incident observability gaps
Analyze operational telemetry and trend data to identify capacity risks, recurring constraints, and opportunities for efficiency
Build and maintain capacity and performance dashboards and reports that communicate posture, risk, and recommendations to technical and leadership stakeholders
Define capacity thresholds, alert baselines, and trigger points for scaling, technology refresh, and resource reallocation
Drive continuous improvement of observability coverage, alert quality, runbook linkage, and operational maturity aligned to SEC SLA/KPI expectations

Requirements:

Citizenship/Work Authorization: Must meet contract requirements
Clearance: Ability to obtain and maintain SEC Public Trust (or higher if required)
Minimum 8 years of experience in IT infrastructure or platform engineering roles, including 5+ years focused on observability, performance engineering, or site reliability engineering
Demonstrated experience engineering and operating an enterprise observability platform (Datadog strongly preferred; equivalent experience with Dynatrace, New Relic, Splunk Observability, or Grafana/Prometheus stacks considered)
Proven experience building APM and distributed tracing coverage for production multi-tier applications -- including language-specific tracer deployment, custom instrumentation of business transactions, service/dependency mapping, continuous profiling, and RUM-to-backend trace correlation -- across cloud and containerized workloads
Proven experience leading complex production performance and reliability problem-solving from telemetry to remediation
Hands-on experience monitoring Kubernetes or OpenShift clusters and containerized workloads in production
Enterprise observability platforms (Datadog or comparable): metrics, logs, traces, APM, RUM, synthetic, NPM
Instrumentation with OpenTelemetry, Datadog agents/SDKs, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) including custom spans, trace sampling strategies, W3C TraceContext propagation, and continuous profiling
Microsoft Azure and AWS monitoring services and integrations (Azure Monitor, Log Analytics, CloudWatch, AWS X-Ray)
Container and Kubernetes/OpenShift observability, including cluster, workload, and service mesh telemetry
Cloud database monitoring: AWS RDS/Aurora (including Performance Insights), Azure SQL/PostgreSQL/MySQL (Query Performance Insight), and NoSQL/cache (DynamoDB, Cosmos DB, ElastiCache/Redis); query-level performance tuning, execution-plan analysis, and Datadog DBM or equivalent deep database APM
Infrastructure-as-code for monitoring (Terraform, Bicep, ARM) and CI/CD-driven monitor/dashboard deployment
APM and distributed tracing: service/dependency maps, trace analytics, RUM-to-backend correlation, exception/error tracking, deployment tracking, and trace-based SLOs
Unified tagging strategy and cardinality governance across metrics/logs/traces (environment, service, version, ownership, data classification, cost center), including custom tag enrichment and tag-driven access/cost controls
Alert engineering, SLO/SLI design, error budget management, and alert-noise reduction
Performance engineering, capacity analysis, and telemetry-driven root-cause analysis
Integration of observability with ITSM (ServiceNow) and on-call/paging workflows
Experience supporting federal agency IT environments under FISMA/FedRAMP/NIST-aligned security and compliance requirements
Datadog certification (Fundamentals and/or Administrator) or comparable enterprise observability certification
Hands-on experience with Red Hat OpenShift Virtualization (CNV/KubeVirt) or other KubeVirt-based container virtualization observability
Experience with eBPF-based observability tooling and service mesh telemetry (Istio, Linkerd)
Experience implementing SLOs and error budgets at enterprise scale and integrating them into operational governance
Experience with cost-aware observability practices, including telemetry volume optimization and retention tuning
Experience integrating observability outputs with executive reporting, SLA/KLI dashboards, and capacity forecasting
ITIL 4 Foundation
AWS Certified Solutions Architect - Associate (or higher)
Microsoft Certified: Azure Administrator Associate (or higher)
Red Hat Certified Specialist in OpenShift Administration (or equivalent)
HashiCorp Terraform Associate

Cloud Engineer - Senior (Observability - DataDog)

Key skills

About this role

Responsibilities:

Requirements: