AI Platform – Cloud/Site Reliability Engineer

Attention Job Applicants:

At Amgen, we are relentless in applying the highest ethical standards to our products, services and communications. Consistent with this, we expect all job applicants to act with honesty and integrity. Providing any false or misleading information, or omitting material information during the hiring process, may result in immediate disqualification from the hiring process or termination if already employed. We appreciate your cooperation in helping us uphold these standards.

AI Platform – Cloud/Site Reliability Engineer

India - Hyderabad Apply Now

JOB ID: R-225945 ADDITIONAL LOCATIONS: India - Hyderabad WORK LOCATION TYPE: On Site DATE POSTED: Sep. 29, 2025 CATEGORY: Information Systems

Role Description:

We are looking for a Cloud/Site Reliability Engineer (SRE) to join our AI Platform team, focused on building and maintaining highly available, scalable, and secure infrastructure for AI/ML workloads. This role is critical to ensure the reliability and performance of our AI services and platform components across cloud environments.

You will work closely with software engineers, ML engineers, and platform architects to design and implement robust monitoring, alerting, and incident response systems. You’ll also contribute to automation of infrastructure provisioning, deployment pipelines, and performance tuning of AI workloads in production.

Roles & Responsibilities:

Design and implement scalable, resilient cloud infrastructure to support AI/ML workloads.
Develop and maintain observability tools including monitoring, logging, and alerting systems for AI platform services.
Automate infrastructure provisioning and deployment using Infrastructure-as-Code (IaC) tools.
Collaborate with engineering teams to ensure high availability and performance of AI services.
Lead incident response and root cause analysis for platform outages or performance degradation.
Implement security best practices and compliance controls across cloud environments.
Optimize resource usage and cost efficiency of AI workloads in cloud environments.
Participate in sprint planning and contribute to platform architecture and reliability strategy.

Must-Have Skills:

Strong experience with cloud platforms (AWS, GCP, Azure) and cloud-native services.
Proficiency in scripting and automation (Python, Bash, Terraform, etc.).
Experience with containerization and orchestration (Docker, Kubernetes).
Familiarity with monitoring and logging tools (Prometheus, Grafana, ELK, Datadog).
Understanding of CI/CD pipelines and DevOps practices.
Experience with incident management, root cause analysis, and reliability engineering.
Knowledge of security principles and cloud compliance frameworks.
Ability to learn quickly, be organized and detail oriented.

Good-to-Have Skills:

Exposure to AI/ML workloads and performance tuning for model inference and training.
Experience with MLOps tools (MLflow, Kubeflow, Airflow).
Familiarity with service mesh technologies (Istio, Linkerd).
Experience with cost optimization strategies in cloud environments.
Knowledge of distributed systems and fault-tolerant architecture.

Education and Professional Certifications:

Bachelor’s degree in computer science, Engineering, or related field.
5–9 years of experience in cloud infrastructure, DevOps, or SRE roles.
Certifications in cloud platforms (AWS Solutions Architect, Azure Administrator, Google Cloud SRE) are a plus.

Soft Skills:

Excellent analytical and troubleshooting skills.
Strong verbal and written communication skills.
Ability to work effectively with global, virtual teams.
High degree of initiative and self-motivation.
Ability to manage multiple priorities successfully.
Team-oriented, with a focus on achieving team goals.
Strong presentation and public speaking skills.

Apply Now

Live. Win. Thrive.

Sign Up for Job Alerts

Stay up to date on Amgen news and opportunities. Sign up to receive alerts about positions that suit your skills and career interests.

Interested In

Category*

Location

Information Systems, Hyderabad, Telangana, IndiaRemove
Remove

AI Platform – Cloud/Site Reliability Engineer

AI Platform – Cloud/Site Reliability Engineer

Sign Up for Job Alerts

Related Content

Mission, Vision and Values

Diversity, Inclusion and Belonging

Amgen Stories

AI Platform – Cloud/Site Reliability Engineer

AI Platform – Cloud/Site Reliability Engineer

SHARE THIS JOB

Sign Up for Job Alerts

Related Content

Mission, Vision and Values

Diversity, Inclusion and Belonging

Amgen Stories