Role Overview

Manage/Administer/Deploy Kubernetes and Spark cluster environments, on bare-metal and container infrastructure, including service allocation and configuration for the cluster, capacity planning, performance tuning, and ongoing monitoring
Define and refine processes and procedures for the site reliability engineering practice
Setup, manage and maintain Kubernetes based scalable environments for high-availability and work with vendors for smooth and continuous operations
Work closely with data scientists, data architects, data engineers, ETL developers, cybersecurity, network, Linux, other IT counterparts, and business partners to design and setup the environments to manage the ingested and processed datasets from the external sources, internal systems, and the data warehouse to extract features of interest
Evaluate, research, experiment with data processing, management and scalability technologies in a lab to keep pace with industry innovation while assessing business impact and viability for use cases associated with efforts in hand
Design, setup, test, deploy, monitor, document, and troubleshoot data processing and associated automation issues from the operations perspective
Work with IT Operations and Information Security Operations with monitoring and troubleshooting of incidents to maintain service levels
Work with Information Security Vulnerability Management and vendors to remediate known impacting vulnerabilities
Contribute to the evolving distributed systems architecture to meet changing requirements for scaling, reliability, performance, manageability, and cost
Report utilization and performance metrics to user communities
Contribute to planning and implementation of new/upgraded hardware and software releases
Responsible for monitoring the Linux, Kubernetes, Object Storage(MinIO), Feature Store, and Spark
Research and recommend innovative, and where possible, automated approaches for administration tasks
Identify approaches to efficiencies in resource utilization, provide economies of scale, and simplify support issues
Responsible for administration of Machine Learning platforms & Operations (MLOps) Such as Kubeflow/Jupyterhub/Python
This role will support GMF international operations and will closely align with our GMF IT NorthStar architecture and operating Principles

Requirements

5-7 years of hands-on experience with supporting Linux production environments required
5-7 years of hands-on administration experience on Spark required
3-5 years hands-on experience with scripting with bash, perl, ruby, or python required
3-5 years experience with Docker Datacenter required
2-4 years of hands-on administration experience on Machine learning platforms required
Minimum of 1 year of experience in Mesos, Kubernetes, OpenShift and/or Deis or other such container/platform-as-a-service orchestrator required
Minimum of 1 year of hands-on experience on CICD tools & Technologies required
Minimum of 1 year of lead experience of site reliability engineering team required
Hands-on experience in cloud technologies with Microsoft Azure required
High School Diploma or equivalent required
Bachelor’s Degree in related field or equivalent experience required
Master’s Degree Preferred.

Tech Stack

Azure
Cloud
Cyber Security
Distributed Systems
Docker
ETL
Kubernetes
Linux
OpenShift
Perl
Python
Ruby
Spark

Benefits

401K matching
bonding leave for new parents (12 weeks, 100% paid)
training
GM employee auto discount
community service pay
nine company holidays.

Lead Site Reliability Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits