Location
Beijing, Beijing, China
Salary
Not specified
Type
fulltime
Posted
Today
Job Description
该职位来源于猎聘 About the Project \& Our Company We are one of the biggest accounting companies in the world with 25,000 staff in China. We are executing a high-visibility, strategic project to implement a separate, dedicated cloud-based instance of our global accounting system within China. This project is crucial for our future. The new system must utilize exclusively Chinese AI technologies and Large Language Models (LLMs) (e.g., Baidu ERNIE, Alibaba Tongyi Qianwen) to adhere to regulatory requirements. This role joins a supportive global team of over 200 technology professionals. Role Overview We are seeking a proactive Site Reliability Engineer (SRE) to apply a specialized engineering approach to keeping our systems running. This role bridges the gap between the Development teams (who build the accounting system) and Operations teams (who maintain it). You will ensure the system is reliable, scalable, and efficient while operating on local Chinese cloud infrastructure. We are looking for individuals with a background in software engineering and systems, prioritizing candidates with a strong academic background and technical aptitude, including those with low experience (0-3 years). Key Responsibilities
- Define and Maintain Service Reliability: Establish and monitor key Service Level Indicators (SLIs), set robust Service Level Objectives (SLOs), and manage the Error Budget to balance feature velocity against system stability.
- AI Reliability Monitoring: Monitor the operational health and performance of integrated Chinese LLMs, specifically focusing on Inference Latency to prevent the AI from slowing down the core accounting software.
- Local Infrastructure Operations: Manage the operational requirements of the China instance, specifically overseeing the connection between global code and local Chinese cloud providers (e.g., 21Vianet, Alibaba Cloud or Huawei Cloud).
- Automation and Toil Reduction: Proactively identify manual operations ("Toil") and write code, scripts, and automation tools to eliminate them, improving the system's "Self-Healing" capabilities.
- Engineering Excellence: Drive the adoption of modern DevOps practices, including automated testing, CI/CD practices, and infrastructure as code, ensuring reliable and scalable delivery.
- Compliance Integration: Work with security and development teams to ensure all infrastructure configurations and operational procedures comply with China's data security laws and internal governance policies.
- Collaboration and Communication: Document technical specifications clearly in both English and Chinese, and participate in agile ceremonies with the global team, requiring proactive English communication skills. Qualifications \& Experience Essential Requirements:
- Bachelor’s or Master’s degree in Computer Science, Software Engineering, Information Technology, or a related STEM field from a leading university.
- 0-3 years of relevant work experience; strong academic record and demonstrated technical aptitude are required.
- Proficiency in programming, particularly Python.
- Familiarity with systems administration, network fundamentals, and cloud environments (such as AWS, Azure, or GCP, or local Chinese cloud services).
- Understanding of the software development lifecycle and methodologies.
- Strong verbal and written English communication skills for effective collaboration in a multinational team environment. Highly Desirable (Bonus):
- Familiarity with DevOps practices and tools, including CI/CD pipelines.
- Experience with containerization technologies (e.g., Docker, Kubernetes).
- Knowledge of MLOps practices relevant to the Chinese ecosystem.
- Understanding of the Chinese AI technology landscape and local data center operations. --------------------------------------------------------------------------------
SRE vs. Traditional Support: The SRE role is an engineering function, meaning the focus is proactive, building systems so they do not break. In contrast, traditional IT support is reactive, primarily fixing issues after they happen, relying on manual checklists and tickets. The SRE uses code to detect server hangs and restart the system automatically, maximizing the system's ability to heal itself.
Looking for more opportunities?
Browse thousands of graduate jobs and entry-level positions.