Beacon AI is a fast-moving team building an AI platform to enhance aviation safety and efficiency. They are seeking skilled Cloud and ML Infrastructure Engineers to lead the development of their AWS foundation and LLM platform, focusing on scalable and secure service design and implementation.
Responsibilities:
- Cloud Infrastructure Setup and Maintenance
- Design, provision, and maintain AWS infrastructure using IaC tools such as AWS CDK or Terraform
- Build CI/CD and testing for apps, infra, and ML pipelines using GitHub Actions, CodeBuild, and CodePipeline
- Operate secure networking with VPCs, PrivateLink, and VPC endpoints. Manage IAM, KMS, Secrets Manager, and audit logging
- LLM Platform and Runtime
- Stand up and operate model endpoints using AWS Bedrock and/or SageMaker; evaluate when to use ECS/EKS, Lambda, or Batch for inference jobs
- Build and maintain application services that call LLMs through clean APIs, with streaming, batching, and backoff strategies
- Implement prompt and tool execution flows with LangChain or similar, including agent tools and function calling
- RAG Data Systems and Vector Search
- Design chunking and embedding pipelines for documents, time series, and multimedia. Orchestrate with Step Functions or Airflow
- Operate vector search using OpenSearch Serverless, Aurora PostgreSQL with pgvector, or Pinecone. Tune recall, latency, and cost
- Build and maintain knowledge bases and data syncs from S3, Aurora, DynamoDB, and external sources
- Evaluation, Observability, and Cost Governance
- Create offline and online eval harnesses for prompts, retrievers, and chains. Track quality, latency, and regression risk
- Instrument model and app telemetry with CloudWatch and OpenTelemetry. Build token usage and cost dashboards with budgets and alerts
- Add guardrails, rate limits, fallbacks, and provider routing for resilience
- Safety, Privacy, and Compliance
- Implement PII detection and redaction, access controls, content filters, and human-in-the-loop review where needed
- Use Bedrock Guardrails or policy services to enforce safety standards. Maintain audit trails for regulated environments
- Data Pipeline Construction
- Build ingestion and processing pipelines for structured, unstructured, and multimedia data. Ensure integrity, lineage, and cataloging with Glue and Lake Formation
- Optimize bulk data movement and storage in S3, Glacier, and tiered storage. Use Athena for ad-hoc analysis
- IoT Deployment Management
- Manage infrastructure that deploys to and communicates with edge devices. Support secure messaging, identity, and over-the-air updates
- Analytics and Application Support
- Partner with product and application teams to integrate retrieval services, embeddings, and LLM chains into user-facing features
- Provide expert troubleshooting for cloud and ML services with an emphasis on uptime and performance
- Performance Optimization
- Tune retrieval quality, context window use, and caching with Redis or Bedrock Knowledge Bases
- Optimize inference with model selection, quantization where applicable, GPU/CPU instance choices, and autoscaling strategies