Site Reliability Engineer Jobs - IT Site Reliability Engineer, 15105

at Venerable Holdings, Inc.
Location West Chester, PA
Date Posted April 5, 2019
Category Default
Job Type Full-time


This is a great opportunity for a driven individual with an enterprise-wide technology view and diverse technical skills to establish and incorporate Site Reliability capabilities as part of a new Cloud migration.    The Site Reliability Engineer (SRE) will apply software engineering techniques and disciplines to transform the way IT Operations ensures availability, scalability, security, and reliability of Venerable IT systems.   This operationally-focused individual will leverage their cross-functional collaboration skills and unwavering quality focus to improve processes and automate routine tasks.  Success will be measured via metrics such as mean time to recover (MTTR) and mean time to failure (MTTF).

Principal Responsibilities:

  • Ensure the operational integrity, security, and effectiveness of development, validation, and production platforms.  Measure and manage performance against SLAs.
  • Define and enforce operational hand-off guidelines/frameworks to ensure everything that goes into production is secure, sustainable, and supportable.  Work with domain experts to create & maintain required artifacts.
  • Establish SLAs and means to measure them.
  • Troubleshoot and resolve complex inter-disciplinary issues.  Fix systemic problems.
  • Collaborate with cross-functional teams, providing recommendations through the entire lifecycle.
  • Drive continued improvement through proactive identification and trend analysis.   Provide the ability to measure the quality of delivered systems.   Use automation to make routine activities efficient, repeatable, secure, and achievable without the need for “hands on keyboards."
  • Develop infrastructure as code (IaC).  Improve infrastructure reliability by applying Test Driven Development (TDD) principles to infrastructure and platform services.
  • Proactively identify bottlenecks and potential points of failure. Implement impactful recommendations and measure effectiveness.
  • Perform system upgrades, updates, and patching. Streamline using automation.
  • Design effective monitoring / alerting approaches to proactively identify anomalies.  Work with stakeholders to define metrics, triggers, and thresholds.
  • Operationalize commonly used tools/services/platforms making them reusable, accessible, and supportable across the enterprise.  Facilitate tool selection, instantiation, and adoption for common services such as log aggregation.  Build tools to make AWS services available while obfuscating implementation details from the service consumers. Support Continuous Integration/Continuous Delivery (CI/CD).
  • Configure/help build pipelines facilitating automated testing and scalable deployments.
  • Research, evaluate, and implement operational improvements and architectural modifications.
  • Participate in change control, release planning, and other operational planning.
  • Other duties as assigned.

Knowledge, Skills, and Abilities

  • Proven proficiency across core technical competencies including: DevOps tooling, troubleshooting/debugging, infrastructure engineering, automated pipeline, scripting, infrastructure as code (IaC), and AWS cloud services.
  • 6-10 years of related IT experience working in a 24x7 highly integrated organization with at least 2 years of direct SRE experience.
  • Self-starter with the drive to get things done right and ability to work effectively in a collaborative cross-functional team environment.
  • Expert ability to automate routine tasks.
  • Direct infrastructure as code (IaC) experience in large, highly available, cloud-centric environments.
  • Strong analytical, organizational, and problem-solving skills.  Demonstrated good judgement and decision-making ability; effectively recognizing and escalating issues as appropriate.
  • Proven expert system debugging and optimizing ability.
  • Demonstrable ability defining metrics that measure performance against SLAs/objectives.
  • Proven ability to implement design patterns promoting effective reusability, resiliency, and performance.

Proficiency utilizing standard SRE tools:

  • Configuration Automation: i.e. Ansible, Chef, Cloud Formation, Puppet, Terraform
  • Build/Orchestration:  i.e. Jenkins, GoCD, AWS CodeDeploy
  • Version Control & Repository: i.e. GitHub, Artifactory, SVN
  • Monitoring/Alerting Tools i.e.: App Dynamics, CloudWatch, Grafana, Nagios, New Relic
  • Log Aggregation and Parsing: i.e. ELK, Splunk
  • Production Debugging Tools: i.e. Heap, Core, Process dumps, Wireshark/network traffic captures
  • Scripting Tools: i.e. Bash, Powershell
  • AWS Administration and Network experience is a plus.
  • Financial industry experience preferred.