Site Reliability Engineer III
Site Reliability Engineer III
India - Hyderabad APLICAR AHORA
ID de la oferta R-236583
País:
India - Hyderabad
Estado:
On Site
Fecha de publicación Feb. 11, 2026
CATEGORÍA DE EMPLEO: Engineering
Position Overview
The GCF5 Site Reliability Engineer is the senior technical leader for the HPC Enablement pillar. They define and socialize operational standards and patterns, lead multi-team delivery, mentor GCF4 engineers, and translate researcher needs into scalable compute enablement designs. They own pillar-level reliability, performance, cost efficiency, and SLA/SLO outcomes, and influence cross-team engineering quality.
This role reports to the GCF7 leader and partners closely with peer GCF5 domain leads across SCIP to ensure cohesive, scalable platform evolution.
Core Responsibilities
- Own the compute reliability and enablement roadmap within SCIP.
- Define onboarding playbooks and golden paths for HPC workloads.
- Establish containerization and reproducible runtime standards.
- Optimize scheduler configuration and resource allocation policies.
- Conduct workload profiling and performance tuning.
- Define and manage SLOs, reliability standards, and operational guardrails.
- Lead incident response and reduce recurring failures.
- Mentor engineers and elevate reliability practices.
- Partner with scientific teams to translate compute requirements into scalable infrastructure patterns.
Core Competencies
- Deep expertise in HPC Enablement (HPC) with evidence of standard‑setting and reuse.
- Systems design at scale (HPC); performance, security, and observability fundamentals.
- Product/engineering thinking: road mapping, prioritization, and outcome‑oriented delivery.
- Stakeholder influence across science, engineering, and governance forums; crisp written/verbal communication.
Core Success Measures
- HPC job success rate improvement.
- Reduction in MTTR for compute incidents.
- Performance improvements relative to baseline.
- Time-to-onboard new scientific workloads.
- Improvement in cost-per-compute-hour efficiency.
- Reduction in operational toil via automation.
Key Relationships
- Collaborates with GCF6 Group Lead and cross‑functional leaders (R&D/PD/Dev).
- Mentors and develops GCF4 Data and Software Engineers, partners with platform, data, ML, and research teams.
- Interfaces with governance (architecture, security, compliance) and vendor/partner teams.
Decision Authority
- Approve designs within the pillar; define and waive standards/patterns with rationale.
- Recommend buy‑vs‑build; commit pillar resources to meet SLAs/SLOs; escalate risks.
- Prioritize pillar backlog and roadmap in alignment with strategy and OKRs.
Qualifications
Basic Qualifications:
- BS+8 / MS+6 / PhD in CS/Engineering/Data disciplines.
- Demonstrated production delivery experience in HPC at scale.
- Demonstrated literacy in a relevant scientific domain (e.g., biology, chemistry, therapeutic discovery).
- Preferred Qualifications:
- Depth in HPC Enablement (HPC).
- Kubernetes and continuous integration/continuous delivery (CI/CD) at scale; observability, performance tuning, and security-by-design.
- Evidence of standard‑setting and cross‑team influence; mentoring experience.