Skip to content

Site Reliability Engineer

Alibaba Group

USonsitePosted Jun 3, 2026

Skills

kubernetesprometheusgopythonistioc++java

About the role

About the Job

We are the AI Inference Platform at Alibaba Group, committed to delivering a cutting-edge MaaS platform and toolkits for application development through technological innovation and engineering practices. Our team focuses on the fundamental R&D in model services, while also providing full-stack development that ranges from architecture design to model applications. Our goal is to build the industry's largest model service platform with excellent cost-efficiency, high performance, and enterprise-level reliability. By doing so, we aim to empower numerous enterprise clients to accelerate the development of model applications.

We are seeking a passionate and technically skilled Site Reliability Engineer (SRE) to join the our team. You will play a critical role in building and maintaining highly available, high-performance model service platform. Your responsibilities will include optimizing monitoring and alerting, incident response, troubleshooting customer issues, and developing automation systems to ensure the stability and reliability of AI Inference Platform and system aplications.

Key Responsibilities

1. Oversee the deployment, operation, maintenance, and continuous improvement of the standalone website and platform, including its initial construction and subsequent operational changes.

2. System Reliability Oversee the monitoring and alerting of our platform's and system aplications, rapidly diagnosing and resolving network, service, and hardware-level failures to meet SLA targets. Design and optimize monitoring metrics, log collection, and alerting strategies to enhance system observability. Participate in the emergency response and handling of online incidents, conduct root cause analysis (RCA), and drive long-term solutions to prevent recurrence.

3. Customer Issue Resolution Investigate and resolve customer-reported issues related to QoS of API service(e.g., latency, performance, optimization), collaborating with development teams to identify flaws in application clusters, edge networks, or infrastructure.

4. Automation & Continuous Improvement Develop tools and scripts (Python/Go) to automate deployment, scaling, fault recovery, and other operational workflows. Build automated diagnostic toolchains to accelerate issue resolution and improve customer satisfaction.

Position Requirement

Minimum qualification:

- 3+ years of experience in SRE, DevOps, or backend development, with expertise in distributed system operations. Experience in cloud computing, AI infrastructure, Alibaba Cloud is a plus.

- Experience programming with at least one modern language such as Python, Golang, Java, C++.

- Strong ability to work under pressure, manage critical incidents, and participate in an on-call rotation.

- Fluency in both Chinese and English for daily communication.

Preferred qualification:

- Familiarity with MaaS or related knowledge.

- Deep knowledge of Linux systems, network protocols (TCP/HTTP), and databases, have deep understanding of cloud-native architecture design.

- Experience with large-scale containers, kubernetes cluster operation and maintenance, have strong professional knowledge of Cloud Native related components (e.g., Prometheus, Istio, Calico, etc.).

- Extensive experience in building large-scale monitoring systems and utilizing them for in-depth analysis and operations.

Alibaba U.S. based full time regular employees have access to medical, dental, and vision insurance, a 401(k) plan and basic life insurance, and wellbeing benefits like FSA, subject to the terms and conditions of the applicable plans then in effect. U.S. based employees are also eligible to receive up to 12 paid holidays, accrue up to 15 paid vacation days for this position, and receive up to 72 hours paid sick time (front-loaded) per calendar year.

Questions about this role

  • How do I apply to this Site Reliability Engineer role at Alibaba Group?

    Click "Apply with AI Applyd" above. We auto-fill the application from your resume and answer screening questions in seconds. No copy and paste, no juggling tabs.

  • What's the typical salary for DevOps / SRE in United States?

    Compensation for DevOps / SRE roles in United States varies widely by seniority, employer size, and remote vs onsite arrangement. Check the salary range on this listing when published, or browse our DevOps / SRE hub for United States medians across recent openings.

  • How fast does AI Applyd auto-apply?

    Most applications complete in under 90 seconds. You can track the status in your dashboard and watch the screenshot proof land the moment the application submits.

  • What ATS does Alibaba Group use?

    AI Applyd supports Greenhouse, Lever, Ashby, Workday, iCIMS, SmartRecruiters, LinkedIn Easy Apply, and most other ATS platforms. If we can submit through the platform, we do.

Want AI Applyd to auto-apply to roles like this?

We tailor your resume per posting, fill the forms, and track replies for you.