|Date Posted||September 4, 2019|
Site Reliability Engineer (contract)
- Engage in and improve the whole lifecycle of software development services—from inception and design, through deployment, operation, and refinement.
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
- Work closely with development and operations teams to build highly available, cost effective systems with extremely high uptime metrics. Work with teams across organization and ensures core services reliability and keep an eye on capacity and performance.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health in a 24x7 environment.
- Participate in 24x7X365 an on-call support for multiple core platforms globally. Using a “Follow the Sun” model, we expect working patterns will include on call duty, weekend and holiday season cover.
- Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
- Practice sustainable incident response and blameless postmortems. Influence and create new designs, architecture, standards, and methods for large-scale systems.
- Binding and orchestrating the system infrastructure with the application layer to enable High Availability/Clustering load balancing and integration
- Provide technical guidance or support for the development or troubleshooting of systems
- Responsible for establishing end-to-end monitoring and alerting on all critical aspects to ensure SLOs, SLIs, and SLAs and get proactive notifications of possible issues for all systems
- Develop automated solutions to address potential problems before they result in a service interruption and demonstrate a passion for automation, including CI/CD automation
- Establish performance baseline, capacity thresholds, correlate events, and define monitoring/alerting criteria.