|Date Posted||August 29, 2019|
Site Reliability Engineer
- Engage in and improve the whole lifecycle of software development services—from inception and design, through deployment, operation, and refinement.
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
- Work closely with development and operations teams to build highly available, cost effective systems with extremely high uptime metrics.
- Work with teams across organization and ensures core services reliability and keep an eye on capacity and performance.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health in a 24x7 environment.
- Participate in 24x7X365 an on-call support for multiple core platforms globally. Using a “Follow the Sun” model, we expect working patterns will include on call duty, weekend and holiday season cover.
- Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
- Practice sustainable incident response and blameless postmortems.
- Influence and create new designs, architecture, standards, and methods for large-scale systems.
- Binding and orchestrating the system infrastructure with the application layer to enable High Availability/Clustering load balancing and integration;
- Provide technical guidance or support for the development or troubleshooting of systems;
- Responsible for establishing end-to-end monitoring and alerting on all critical aspects to ensure SLOs, SLIs, and SLAs and get proactive notifications of possible issues for all systems;
- Develop automated solutions to address potential problems before they result in a service interruption and demonstrate a passion for automation, including CI/CD automation;
- Establish performance baseline, capacity thresholds, correlate events, and define monitoring/alerting criteria.
- Bachelors of Science degree in Computer Science, Engineering, or equivalent relevant experience.
- Expertise in designing, analyzing and troubleshooting large-scale distributed systems.
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive;
- Ability to debug and optimize code and automate routine tasks;
- Overall 6+ years of experience in one or more of the following:
- Experience in building JavaEE applications using, build tools like Maven/ANT, Subversion, JIRA Jenkins, Bitbucket and Chef;
- Experience in continuous integration tools (Jenkins, SonarQube, JIRA, Nexus, Confluence, GIT-BitBucket, Maven, Gradle, RunDeck, is a plus);
- You've created automation using Chef, Puppet or another SCM tool; Docker and container scheduler services such as ECS or Kubernetes is desirable;
- You've worked with Nginx, Tomcat, HAProxy, Redis, Elastic Search, MongoDB, and RabbitMQ, Kafka, Zookeeper;
- Experience as SCM/release engineer, or in a position with similar skill sets and responsibilities (Software Engineer, Systems Engineer, Systems Administrator);
- Experience in performing source code control management Subversion/GIT including branching, merging, tagging, etc.;
- Experience in configuring and administering JavaEE application servers (Tomcat, WebSphere, WebLogic, etc.);
- Experience in with scripting language such as Unix Shells, Python, Perl, Shell, bash, ksh);
- Experience in configuring, building, and supporting apps and operations in a public cloud environment (AWS, Azure, GCP);
- Experience with Monitoring and Logging tools (Elastic Search, ELK, AppDynamics, Splunk, etc.);
- Collaborate well with team members, developers, QA, and ownership teams to resolve issues;
- Knowledge of Agile / Scrum methodologies and principles;
- Possess excellent written and verbal communication skills with the ability to communicate with team members at various levels, including business leaders;
- A real passion for and the ability to learn new technologies.