|Date Posted||June 5, 2019|
System Reliability Engineer DevOps
As a System Reliability Engineer, full-stack engineer that can help support teams from the Front End to the Back End. You have experience building fully automated, highly elastic, cloud-orchestrated platforms over various IaaS providers like AWS, GCE, and / or Azure. Drive Individual seek to make application and anything around you better. You see containers as the future of CICD and are familiar with how to orchestrate them with frameworks like Docker and Kubernetes either in AWS or GCE.
Site Reliability Engineering (SRE) is a discipline that combines software and systems engineering for building and running large-scale, distributed, fault-tolerant systems. SRE ensures that internal and external services meet or exceed reliability and performance expectations while adhering to company engineering principles.
As a full-stack engineer you will help support teams from the Front End to the Back End. Deep Dive into building fully automated, highly elastic, cloud-orchestrated platforms over various IaaS providers like AWS, GCE. Driven Individual seeks to make application and anything around you better. You see containers as the future of CICD and are familiar with how to orchestrate them with frameworks like Docker and Kubernetes either in AWS or GCE.
SRE is also an engineering approach to building and running production systems – we engineer solutions to operational problems. As SREs are responsible for overall system operation, we use a breadth of tools and approaches to solve a broad set of problems. Practices such as limiting time spent on operational work, blameless postmortems, and proactive identification and prevention of potential outages.
SRE's culture of diversity, intellectual curiosity, problem solving, and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big, and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn, grow, and take pride in our work.
- Highly capable developers in order to interact with other engineering groups and for automating system tasks. You must have proven ability to code within one or more of Ansible, Bash, Python, Ruby, Go, Java Script.
- Strong Linux fundamentals: We leverage linux from local dev all the way to production and will need hands-on across our stack.
- Demonstrated initiative, flexibility and ability to concurrently manage multiple deadline-driven tasks and projects; self-starter
- Experience with open-source solutions like Grafana, Prometheus or commercial solutions like Splunk for log analysis. Splunk required.
- Fast Starter: Start up or nimble technology organization experience would match our fast-paced environment.
- Create mechanisms/architectures that enable rapid recovery, repair and cleanup of faulty migrations with good understanding of fault tolerance and failure domains
- Identify opportunities to deliver self service capability for the most common infrastructure and application management tasks
- Provide detailed levels of monitoring across the application stack
- Writes custom code or scripts to automate infrastructure, monitoring services, and test cases
- Writes custom code or scripts to do "destructive testing" to ensure adequate resiliency in production
- Configures commercial off the shelf solutions to align with evolving business needs
- Creates meaningful dashboards, logging, alerting, and responses to ensure that issues are captured and addressed proactively
- Provides application support for software running in production
- Proactively monitors production Service Level Objectives for products
- Integrate systems using a wide variety of protocols like REST, SOAP, MQ, JSON and others
- Exhibit a deep understanding of server virtualization, networking and storage ensuring that the solution scales and performs with high availability and uptime
- Experience in either Terraform or Ansible, including experience with setting up similar tools from scratch
- Experience with working with Bots ( Hubot Turbot) for deployment automation
- Experience building +maintaining with CI/CD
- Experience with monitoring systems (e.g. DataDog, PagerDuty, Wily, Splunk)
- Proficient in the use of CloudFormation, CloudWatch and CloudTrail
- Experience with higher level network protocols including HTTP and REST
- Experience with Infrastructure as a Service e.g. AWS
- Experience with multiple modern Linux application runtimes such as Node, JVM, Python & C
- Experience working in Agile, Scrum and / or Kanban
- Experience with BitBucket, Ansible, Docker, Kubernetes,
- Experience with CICD tools such as Jenkins, Spinnker is a PLUS
- Experience with Redis, Kafka, Zookeeper
- Experience within a DevOps Culture
- 2+ years DevOps or SRE experience
- 2+ years on a scrum/ agile team
- 2+ years of Cloud Experience is a plus