Optum Logo

Optum

Lead Site Reliability Engineer - AI Platforms

Posted 2 Hours Ago
Be an Early Applicant
In-Office
Bengaluru, Bengaluru Urban, Karnataka
Senior level
In-Office
Bengaluru, Bengaluru Urban, Karnataka
Senior level
Lead Site Reliability Engineer for AI platforms who builds and operates production-scale AI/ML and LLM infrastructure. Collaborates with research and product teams to deploy inference services, RAG pipelines, and scalable cloud-native platforms; implements CI/CD, observability, autoscaling, disaster recovery, security controls, and cost optimizations; mentors engineers and leads incident response and reliability initiatives.
The summary above was generated by AI
Requisition Number: 2369769
Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start Caring. Connecting. Growing together.
Primary Responsibilities:
  • Collaborate with research, engineering, and product teams to translate cutting-edge AI advancements into production-ready capabilities. Uphold ethical AI principles by embedding fairness, transparency, and accountability throughout the model development lifecycle
  • Comply with the terms and conditions of the employment contract, company policies and procedures, and any and all directives (such as, but not limited to, transfer and/or re-assignment to different work locations, change in teams and/or work shifts, policies in regards to flexibility of work benefits and/or work environment, alternative work arrangements, and other decisions that may arise due to the changing business environment). The Company may adopt, vary or rescind these policies and directives in its absolute discretion and without any limitation (implied or otherwise) on its ability to do so

Required Qualifications:
  • 8+ years of experience in SRE, DevOps, or Platform Engineering with large-scale systems
  • Hands-on experience with observability, monitoring, logging, tracing, alerting, and production operations
  • Experience deploying and operating AI inference services, RAG pipelines, vector databases, and AI serving platforms
  • Experience building and supporting CI/CD pipelines, deployment automation, and platform operational workflows
  • Experience implementing auto-scaling, load balancing, disaster recovery, failover, backup, and business continuity solutions
  • Experience supporting multi-region, multi-cluster, and distributed cloud environments
  • Experience working with event-driven architectures, messaging systems, and real-time processing workloads
  • Experience optimizing platform performance, resource utilization, AI inference workloads, and operational costs
  • Experience mentoring junior engineers and contributing to engineering best practices
  • Experience supporting production AI/ML, Generative AI, LLM, or data-intensive platforms
  • Experience with Kubernetes, containerization, and cloud-native deployment practices
  • Experience building and supporting CI/CD pipelines and deployment automation
  • Experience deploying and supporting AI services, APIs, inference endpoints, and RAG-based solutions
  • Experience with Infrastructure as Code (Terraform, CloudFormation, ARM, Pulumi, or equivalent)
  • Experience with monitoring, logging, tracing, observability, and alerting platforms
  • Experience implementing operational controls for backup, recovery, failover, and disaster recovery processes
  • Experience with AWS, Azure, or GCP environments
  • Experience supporting production incidents, troubleshooting, root cause analysis, and operational excellence initiatives
  • Experience optimizing platform reliability, performance, resource utilization, and operational costs
  • Proven experience in SRE, DevOps, Platform Engineering, Cloud Infrastructure, or Production Operations
  • Proven experience supporting and operating production-scale AI/ML, Generative AI, and LLM-based platforms
  • Solid experience implementing MLOps, LLMOps, model deployment, monitoring, and lifecycle management practices
  • Solid experience with cloud-native technologies, Kubernetes, container orchestration, and Infrastructure as Code
  • Knowledge of data security, governance, and compliance requirements for enterprise AI platforms
  • Knowledge of cloud security, IAM, RBAC, encryption, secrets management, and security best practices
  • Understanding of distributed systems, scalability, reliability, fault tolerance, and high-availability concepts
  • Good understanding of distributed systems, high availability, scalability, fault tolerance, and reliability engineering principles
  • Good understanding of security best practices including IAM, RBAC, encryption, secrets management, and Zero Trust principles
  • Familiarity with MLOps, LLMOps, model deployment, monitoring, and AI application lifecycle management
  • Familiarity with event-driven architectures, messaging systems, and streaming platforms
  • Solid scripting and automation skills using Python, Bash, PowerShell, or equivalent technologies
  • Solid scripting and automation skills using Python, Bash, PowerShell, or similar technologies
  • Proven solid troubleshooting, incident management, root cause analysis (RCA), and production support experience
  • Proven ability to independently own platform services and reliability initiatives from implementation through operations
  • Proven solid collaboration and stakeholder management skills across AI/ML, Data Engineering, Security, and Platform teams

Technical Stack
  • Cloud Platforms: AWS, Azure, GCP
  • Containers & Orchestration: Docker, Kubernetes (AKS, EKS, GKE), Helm
  • Infrastructure as Code: Terraform, CloudFormation, ARM, Pulumi
  • CI/CD & GitOps: Jenkins, GitHub Actions, GitLab CI, ArgoCD
  • MLOps / LLMOps: MLflow, Kubeflow, SageMaker, Azure ML, Vertex AI
  • AI Platforms: LangChain, LangGraph, RAG Frameworks, AI Agents
  • Model Serving: KServe, Triton, Seldon, Ray Serve, FastAPI
  • API & Platform Gateway: Kong, NGINX, Envoy, API Gateway
  • Service Mesh: Istio, Linkerd
  • Observability: Prometheus, Grafana, ELK Stack, Datadog, OpenTelemetry
  • Streaming & Messaging: Kafka, Event Hub, Pub/Sub
  • Data & Storage: S3, ADLS, GCS, Databricks, Snowflake, BigQuery
  • Security & Governance: IAM, RBAC, Vault, KMS, Encryption, Secrets Management
  • Networking & Reliability: DNS, CDN, Load Balancers, Traffic Routing, Failover Systems

Preferred Qualifications:
  • Experience with AI model serving platforms such as KServe, Triton, Seldon, or Ray Serve
  • Experience with LangChain, LangGraph, RAG orchestration, and Agentic AI workflows
  • Experience configuring API gateways, model gateways, and service mesh technologies
  • Experience with Istio, Linkerd, or enterprise service mesh platforms
  • Experience supporting multi-region and multi-cluster deployments
  • Experience in Banking, Healthcare, Financial Services, or other regulated industries
  • Knowledge of governance, compliance, and regulatory standards such as GDPR, HIPAA, SOC2, or ISO 27001
  • Exposure to GPU-based AI infrastructure and inference workloads
  • Exposure to FinOps, cloud cost optimization, and AI infrastructure cost management
  • Exposure to Platform Engineering and Internal Developer Platforms (IDP)

At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone-of every race, gender, sexuality, age, location and income-deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes - an enterprise priority reflected in our mission.
#NIC

Similar Jobs at Optum

2 Hours Ago
In-Office
Senior level
Senior level
Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
Lead hands-on experimentation and prototype development for Generative and Agentic AI, focusing on LLMs, RAG, vector retrieval, and production-ready solution blueprints. Drive POCs to production by creating reusable prompts, architectures, evaluation frameworks, and implementation artifacts while partnering with engineering and product teams to ensure scalability, cost optimization, and responsible AI practices.
Top Skills: Agentic Ai FrameworksAi Development Lifecycle (Aidlc)AWSAzureDeep LearningEmbeddingsGoogle Cloud PlatformLarge Language Models (Llms)Machine LearningPrompt EngineeringPythonPyTorchRetrieval-Augmented Generation (Rag)SQLTensorFlowVector Databases
2 Hours Ago
In-Office
Expert/Leader
Expert/Leader
Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
Lead design and delivery of AI-powered, enterprise-scale full stack applications focused on modern frontend frameworks, scalable backends (BFF/REST/GraphQL), real-time streaming, distributed architectures, cloud-native deployments, security/compliance, and mentoring engineering teams from concept to production.
Top Skills: Agentic Ai WorkflowsAngularApache KafkaAutogenAWSAzureAzure Ai SearchBigQueryChromadbCi/CdCrewaiCSS3DatabricksDjangoDockerExpress.JsFastapiFlaskGoogle Cloud PlatformGraphQLHTML5IsrJavaScriptKubernetesLangchainLanggraphLlamaindexLlmsNeo4JNestjsNext.JsNode.jsOauthOpenid ConnectPgvectorPineconePrompt EngineeringPythonRagRbacReactRestSemantic KernelSemantic SearchServer-Sent Events (Sse)SnowflakeSparkSsoSsrTypescriptVector EmbeddingsWeaviateWebsockets
2 Hours Ago
In-Office
Mid level
Mid level
Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
Develop, test, deploy, and maintain production-grade AI/ML models and systems. Participate across the full AI lifecycle from prototyping to production, implement cloud infrastructure and Infrastructure-as-Code, contribute to reusable ML platforms, collaborate with product and research teams, and uphold ethical AI principles. Troubleshoot production issues and continuously learn from senior engineers while delivering scalable, consumer-facing AI solutions.
Top Skills: Anomaly DetectionAWSAzureCi/CdComputer VisionDatabasesGCPGoInfrastructure As CodeJavaLlmsNlpNode.jsPersonalizationPythonReact NativeRecommendation SystemsRestWebsocket

What you need to know about the Kolkata Tech Scene

When considering the industries shaping India's tech scene, gaming might not immediately come to mind. However, in the last decade, increased internet usage and greater access to mobile devices have catapulted the industry to new heights, with Kolkata-based companies like Virtualinfocom, Red Apple Technologies and Digitoonz, at the forefront, driving the design and animation of new gaming titles for players.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account