Optum Jobs

Lead Site Reliability Engineer - AI Platforms

Optum

Lead Site Reliability Engineer - AI Platforms

Posted 2 Hours Ago

Be an Early Applicant

In-Office

Bengaluru, Bengaluru Urban, Karnataka

Senior level

In-Office

Bengaluru, Bengaluru Urban, Karnataka

Senior level

Lead Site Reliability Engineer for AI platforms who builds and operates production-scale AI/ML and LLM infrastructure. Collaborates with research and product teams to deploy inference services, RAG pipelines, and scalable cloud-native platforms; implements CI/CD, observability, autoscaling, disaster recovery, security controls, and cost optimizations; mentors engineers and leads incident response and reliability initiatives.

The summary above was generated by AI

Requisition Number: 2369769
Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start Caring. Connecting. Growing together.
Primary Responsibilities:

Collaborate with research, engineering, and product teams to translate cutting-edge AI advancements into production-ready capabilities. Uphold ethical AI principles by embedding fairness, transparency, and accountability throughout the model development lifecycle
Comply with the terms and conditions of the employment contract, company policies and procedures, and any and all directives (such as, but not limited to, transfer and/or re-assignment to different work locations, change in teams and/or work shifts, policies in regards to flexibility of work benefits and/or work environment, alternative work arrangements, and other decisions that may arise due to the changing business environment). The Company may adopt, vary or rescind these policies and directives in its absolute discretion and without any limitation (implied or otherwise) on its ability to do so

Required Qualifications:

8+ years of experience in SRE, DevOps, or Platform Engineering with large-scale systems
Hands-on experience with observability, monitoring, logging, tracing, alerting, and production operations
Experience deploying and operating AI inference services, RAG pipelines, vector databases, and AI serving platforms
Experience building and supporting CI/CD pipelines, deployment automation, and platform operational workflows
Experience implementing auto-scaling, load balancing, disaster recovery, failover, backup, and business continuity solutions
Experience supporting multi-region, multi-cluster, and distributed cloud environments
Experience working with event-driven architectures, messaging systems, and real-time processing workloads
Experience optimizing platform performance, resource utilization, AI inference workloads, and operational costs
Experience mentoring junior engineers and contributing to engineering best practices
Experience supporting production AI/ML, Generative AI, LLM, or data-intensive platforms
Experience with Kubernetes, containerization, and cloud-native deployment practices
Experience building and supporting CI/CD pipelines and deployment automation
Experience deploying and supporting AI services, APIs, inference endpoints, and RAG-based solutions
Experience with Infrastructure as Code (Terraform, CloudFormation, ARM, Pulumi, or equivalent)
Experience with monitoring, logging, tracing, observability, and alerting platforms
Experience implementing operational controls for backup, recovery, failover, and disaster recovery processes
Experience with AWS, Azure, or GCP environments
Experience supporting production incidents, troubleshooting, root cause analysis, and operational excellence initiatives
Experience optimizing platform reliability, performance, resource utilization, and operational costs
Proven experience in SRE, DevOps, Platform Engineering, Cloud Infrastructure, or Production Operations
Proven experience supporting and operating production-scale AI/ML, Generative AI, and LLM-based platforms
Solid experience implementing MLOps, LLMOps, model deployment, monitoring, and lifecycle management practices
Solid experience with cloud-native technologies, Kubernetes, container orchestration, and Infrastructure as Code
Knowledge of data security, governance, and compliance requirements for enterprise AI platforms
Knowledge of cloud security, IAM, RBAC, encryption, secrets management, and security best practices
Understanding of distributed systems, scalability, reliability, fault tolerance, and high-availability concepts
Good understanding of distributed systems, high availability, scalability, fault tolerance, and reliability engineering principles
Good understanding of security best practices including IAM, RBAC, encryption, secrets management, and Zero Trust principles
Familiarity with MLOps, LLMOps, model deployment, monitoring, and AI application lifecycle management
Familiarity with event-driven architectures, messaging systems, and streaming platforms
Solid scripting and automation skills using Python, Bash, PowerShell, or equivalent technologies
Solid scripting and automation skills using Python, Bash, PowerShell, or similar technologies
Proven solid troubleshooting, incident management, root cause analysis (RCA), and production support experience
Proven ability to independently own platform services and reliability initiatives from implementation through operations
Proven solid collaboration and stakeholder management skills across AI/ML, Data Engineering, Security, and Platform teams

Technical Stack

Cloud Platforms: AWS, Azure, GCP
Containers & Orchestration: Docker, Kubernetes (AKS, EKS, GKE), Helm
Infrastructure as Code: Terraform, CloudFormation, ARM, Pulumi
CI/CD & GitOps: Jenkins, GitHub Actions, GitLab CI, ArgoCD
MLOps / LLMOps: MLflow, Kubeflow, SageMaker, Azure ML, Vertex AI
AI Platforms: LangChain, LangGraph, RAG Frameworks, AI Agents
Model Serving: KServe, Triton, Seldon, Ray Serve, FastAPI
API & Platform Gateway: Kong, NGINX, Envoy, API Gateway
Service Mesh: Istio, Linkerd
Observability: Prometheus, Grafana, ELK Stack, Datadog, OpenTelemetry
Streaming & Messaging: Kafka, Event Hub, Pub/Sub
Data & Storage: S3, ADLS, GCS, Databricks, Snowflake, BigQuery
Security & Governance: IAM, RBAC, Vault, KMS, Encryption, Secrets Management
Networking & Reliability: DNS, CDN, Load Balancers, Traffic Routing, Failover Systems

Preferred Qualifications:

Experience with AI model serving platforms such as KServe, Triton, Seldon, or Ray Serve
Experience with LangChain, LangGraph, RAG orchestration, and Agentic AI workflows
Experience configuring API gateways, model gateways, and service mesh technologies
Experience with Istio, Linkerd, or enterprise service mesh platforms
Experience supporting multi-region and multi-cluster deployments
Experience in Banking, Healthcare, Financial Services, or other regulated industries
Knowledge of governance, compliance, and regulatory standards such as GDPR, HIPAA, SOC2, or ISO 27001
Exposure to GPU-based AI infrastructure and inference workloads
Exposure to FinOps, cloud cost optimization, and AI infrastructure cost management
Exposure to Platform Engineering and Internal Developer Platforms (IDP)

At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone-of every race, gender, sexuality, age, location and income-deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes - an enterprise priority reflected in our mission.
#NIC

Similar Jobs at Optum

Optum

Machine Learning Engineer

2 Hours Ago

In-Office

Senior level

Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics

Lead hands-on experimentation and prototype development for Generative and Agentic AI, focusing on LLMs, RAG, vector retrieval, and production-ready solution blueprints. Drive POCs to production by creating reusable prompts, architectures, evaluation frameworks, and implementation artifacts while partnering with engineering and product teams to ensure scalability, cost optimization, and responsible AI practices.

Top Skills: Agentic Ai FrameworksAi Development Lifecycle (Aidlc)AWSAzureDeep LearningEmbeddingsGoogle Cloud PlatformLarge Language Models (Llms)Machine LearningPrompt EngineeringPythonPyTorchRetrieval-Augmented Generation (Rag)SQLTensorFlowVector Databases

Optum

Lead Full-stack Engineer

2 Hours Ago

In-Office

Expert/Leader

Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics

Lead design and delivery of AI-powered, enterprise-scale full stack applications focused on modern frontend frameworks, scalable backends (BFF/REST/GraphQL), real-time streaming, distributed architectures, cloud-native deployments, security/compliance, and mentoring engineering teams from concept to production.

Top Skills: Agentic Ai WorkflowsAngularApache KafkaAutogenAWSAzureAzure Ai SearchBigQueryChromadbCi/CdCrewaiCSS3DatabricksDjangoDockerExpress.JsFastapiFlaskGoogle Cloud PlatformGraphQLHTML5IsrJavaScriptKubernetesLangchainLanggraphLlamaindexLlmsNeo4JNestjsNext.JsNode.jsOauthOpenid ConnectPgvectorPineconePrompt EngineeringPythonRagRbacReactRestSemantic KernelSemantic SearchServer-Sent Events (Sse)SnowflakeSparkSsoSsrTypescriptVector EmbeddingsWeaviateWebsockets

Optum

Machine Learning Engineer

2 Hours Ago

In-Office

Mid level

Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics

Develop, test, deploy, and maintain production-grade AI/ML models and systems. Participate across the full AI lifecycle from prototyping to production, implement cloud infrastructure and Infrastructure-as-Code, contribute to reusable ML platforms, collaborate with product and research teams, and uphold ethical AI principles. Troubleshoot production issues and continuously learn from senior engineers while delivering scalable, consumer-facing AI solutions.

Top Skills: Anomaly DetectionAWSAzureCi/CdComputer VisionDatabasesGCPGoInfrastructure As CodeJavaLlmsNlpNode.jsPersonalizationPythonReact NativeRecommendation SystemsRestWebsocket

What you need to know about the Kolkata Tech Scene

When considering the industries shaping India's tech scene, gaming might not immediately come to mind. However, in the last decade, increased internet usage and greater access to mobile devices have catapulted the industry to new heights, with Kolkata-based companies like Virtualinfocom, Red Apple Technologies and Digitoonz, at the forefront, driving the design and animation of new gaming titles for players.