|Date Posted||August 22, 2019|
As a Site Reliability Engineer, you will utilize your software and systems engineering background to build and run large-scale, distributed, fault-tolerant systems. Your role is to ensure that our systems - both internal and externally facing-have reliability and maximum uptime.
Our current team focuses on optimizing existing systems, building infrastructure and eliminating work through automation. You are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to solve a broad spectrum of problems. Practices such as limiting time spent on manual operational work, postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and technical standards.
- Build scalable systems, using best practices around automation, pushing changes that improve reliability and velocity
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, planning and reviews
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
- Provide mentorship and training to other team members on technologies and processes; drive education and knowledge transfer of design patterns, technical practices, and relevant technologies and tools
- Drive high standards around incident response practices and policies
- 4+ years of experience in an Operational role, DevOps, SRE, or Software Engineering
- In-depth experience with cloud computing and solid experience of setup and management of cloud infrastructure
- You can write code - in any language. You’ve implemented your work to production
- Extensive experience with configuration management and infrastructure automation tools, i.e. Ansible, Terraform, Salt Stack, Puppet, Chef, etc.
- Experience with large scale distributed systems in the cloud and concerns like load balancing and disaster recovery
- Experience with the operational aspects of software systems such as monitoring, centralized logging, and alerting
- Bachelor of Computer Science or Computer Engineering