- 岗位职责
As a Site Reliability Engineer, you will be responsible for overseeing and managing servers, infrastructures and installed systems that shoulder our business, ensuring its reliability, availability and performance
What you’ll be doing
● Develop and maintain the large-scale infrastructure that powers our services
● Build and maintain monitoring, alerting, and trending operational tools in cloud environments
● Investigate, diagnose, and resolve performance and reliability problems in a wide range of large-scale and high-throughput services
● Contribute to handbook, runbooks, and general documentation
- 岗位要求
Who you are
● At least 3 years experience working as a software engineer
● Strong coding skills, preferably in Python
● Experience operating a production environment at high scale with emphasis on availability, latency and healthy customer experience
● Experience with infrastructure as code and configuration management tools such as Chef, Salt or Ansible
● Knowledge of Linux systems internals
● knowledge of Computer networking
● Knowledge of container orchestration tools such as Docker, Kubernetes is a plus
● Experience with AWS or other cloud environments.
● Excellent technical writing and documentation skills.
● B.S. degree or equivalent experience.