|Location||San Diego, CA|
|Date Posted||April 7, 2019|
Looking for someone who is passionate about site reliability processes and working on one of the biggest cloud networks that brings in millions of users daily!
The team is responsible for monitoring and operations of their customer facing services and applications. Another team does the provisioning and architecture of the software into AWS and develops the tools then THIS TEAM does the Application-level support of the services. Current team is 5-7 people + the manager. When the services don't work, this is the team that resolves the issues. Now almost fully in AWS, this Site Reliability Engineer must have experience with operations in an AWS environment. 70% production and services (java) support and 30% automation.
The platform is moving away from siloed infrastructure/apps support and each team is now expected to own their services end to end. This is where the industry is moving as a whole and only a few companies in San Diego are moving towards this proven method of efficiency. This would be great for anyone looking to be with a company that is staying on the cutting edge of automation and site reliability processes. Also, the pace and scale at which this team works would make any extremely marketable.
AWS - Deploying, Supporting, and managing applications (sysops)
Java (deployment of Java App and monitor of Java App)
Linux - RHEL or CentOS
Enterprise environment - meaning high volume and critical production environments
As a Site Reliability Engineer and member of the Service Platform Operations Team you will closely support engineering teams in the provisioning, integration, configuration, deployment, monitoring, and incident response of the applications and services at the core of the Network handling millions of users and devices. The Service Platform Operations team handles application deployments, configuration, performance tuning and monitoring, capacity management, and production support for services which enable customers to access and enjoy a wide range of digital entertainment content seamlessly and across various devices and user interfaces. The Sr. Systems Engineer will support the team and drive improvements in process and technology of cloud and on-prem hosted services to improve continuous delivery, incident response, application availability, system resiliency and service monitoring.
The Senior Site Reliability Engineer will provide technical leadership to the Service Platform Operations Team as we configure, integrate, deploy, validate, monitor, and support services and applications. Responsibilities include:
§ Hands-on application management and support for AWS cloud and on-prem production environments, including full-stack diagnosis, fault resolution and root cause analysis.
§ Proactive monitoring of production systems and identify issues before service impact.
§ Drive and Implement monitoring tools/metrics/reports for tracking application/service performance.
§ Collaborate with engineering and system teams to drive changes and ensure optimal application performance and resiliency.
§ Lead service and system performance analysis, service capacity planning, and service continuity validation for multiple applications.
§ Implement automated scripts/tools to automate operational tasks/activities.
§ Review and influence design, architecture, standards, and methods for deploying, monitoring and operating services and applications.
§ Actively participate and/or commit in the execution of tasks required to meet milestones and deliverables set by the SCRUM team throughout the release cycle.
§ Provide rotational on-call support.
§ BS degree in Computer Science, Engineering, or related technical discipline.
§ 5 years hands-on Linux experience (RHEL or CentOS preferred).
§ 3 years of relevant work experience in a high-volume and/or critical production environment.
§ 2 years hands-on AWS experience - Deploying, Supporting, and managing applications (sysops).
§ Proficient in using the typical Linux toolbox of open source software and management tools.
§ Experience with log management tools, e.g. Splunk, Logstash, Kibana.
§ Exceptional scripting skills (python, shell, golang).
§ Hands-on experience in troubleshooting and performance tuning of Java applications.
§ Solid understanding of networking systems and protocols - HTTP, TCP/IP, SSL, DNS.
§ Experience with automation/configuration management using Jenkins, Ansible, Puppet, Chef or similar tool.
§ Experience with Agile SCRUM development methodologies, Continuous Integration and Continuous Delivery (CI/CD).
§ Experience in quality control and validating services in a production environment.
We're partners in transformation. We help clients activate ideas and solutions to take advantage of a new world of opportunity. We are a team of 80,000 strong, working with over 6,000 clients, including 80% of the Fortune 500, across North America, Europe and Asia. As an industry leader in Full-Stack Technology Services, Talent Services, and real-world application, we work with progressive leaders to drive change. That's the power of true partnership. TEKsystems is an Allegis Group company.