Posted: 5/28/2026 - Expires: 7/2/2026
Job ID: 293447539
Apply Now I have already applied Save Job Print Email SharePlatform Stability & High Availability: Conduct health checks, risk assessments, and preventive maintenance for database platform components. Design and implement HA solutions (e.g.,
automated fault recovery, adaptive disaster resilience) and cloud-native technologies. Optimize network architecture and Kubernetes (k8s) cluster operations for database services. Operational Tooling & Automation: Develop platforms/tools for large-scale distributed systems management, including automated deployment, monitoring, and diagnostics. Enhance observability through metrics, logging, tracing, and alerting systems (e.g., Prometheus, Grafana, OpenTelemetry). Incident Management & Optimization: Resolve live-site issues, including performance bottlenecks, capacity scaling, and security threats. Collaborate with product teams to refine architectures, reduce latency, and improve availability. Cross-Functional Collaboration: Drive standardization of control-plane components (e.g., microservice frameworks, metadata services) across database engines.
1. Research and Development of Database Platform Infrastructure
Systems & Products: The employee will design and support Database-as-a-Service (DBaaS) platforms. This includes cloud-native database engines (such as PolarDB, RDS, or similar
distributed SQL/NoSQL databases) and their control-plane orchestration systems. Research Areas: Conduct research on Distributed Consensus Protocols (e.g., Paxos, Raft) to ensure
data consistency and high availability. Research Adaptive Disaster Resilience algorithms to automate failover across multi-region cloud architectures. Process: Lead the end-to-end
lifecycle of high-availability solutions, from architectural design and prototyping to automated stress testing and chaos engineering to validate system robustness under extreme failure
modes.
2. Large-Scale Distributed Systems Management & Tooling
Equipment & Systems: Work extensively with Kubernetes (K8s) orchestration, focusing on Custom Resource Definitions (CRDs) and Operators to manage stateful database workloads.
Tools & Technologies: Develop and maintain internal automation platforms using languages such as Go (Golang), Java, or Python. Utilize Prometheus, Grafana, and OpenTelemetry to
build advanced observability frameworks that provide real-time telemetry and predictive diagnostics for thousands of database nodes. Specific Projects: Development of an automated
Database Fleet Management System that handles seamless patching, scaling, and migration of large-scale distributed clusters without service interruption.
3. Network Architecture and Cloud-Native Optimization
Technical Focus: Optimize the networking stack within virtualized environments (e.g., Service Mesh, VPC configurations, Load Balancers) to minimize tail latency and maximize throughput
for database traffic. Industry Application: These duties are situated within the Cloud Computing and Information Technology Services industry, specifically focusing on Infrastructure-as-
Software and Large-Scale Data Management.
4. Incident Management and Security Performance
Process: Implement a systematic approach to Root Cause Analysis (RCA) for complex live-site incidents involving performance bottlenecks, such as CPU saturation, I/O wait times, or
memory leaks in distributed environments. Security: Design and implement automated security auditing tools to ensure database components comply with industry standards (e.g.,
encryption at rest/in transit, identity and access management).
Telecommuting may be permitted. When not telecommuting, must report to worksite.
Requirements:
- Bachelors degree or foreign degree equivalent in Computer Science, Information Science, or related field.
- 2 years of experience in the Site Reliability Engineer II, or any other related occupation, job title/position.
Worksite Address:
205 108th Ave NE, Suite 400, Bellevue, WA, 98004
Job Summary Company Details Company Alibaba Cloud US LLC Industry All Other Professional, Scientific, and Technical Services Contact method Contact Info Email: Apply by EmailJob Information Location Bellevue, WA Job Type Full Time Employee Education Level Bachelor's degree Job Position 1 Position(s) Open Salary/Wage $144,000.00 - $172,800.00 /year Duration Over 150 Days Additional Information Reference Code 9849968 Federal Contractor No Affirmative Action Plan No View More Jobs All Alibaba Cloud US LLC jobs View similar jobs All Bellevue, WA jobs