Site Reliability Engineer Developer
Long Term Contract
OCC, DC, & Endeavor Job Overview: Delta IT is on a journey to becoming the best IT organization in the airline industry, a journey of transformation. We are changing the way we do business from top to bottom as we strive to create meaningful and innovative solutions and are looking for team members to help us realize our vision. The Sr. Systems Engineer will possess a passion for technology and ensure 24x7 highly available and reliable operations systems for Delta. The successful candidate will be a highly motivated professional who can work in a fast-paced fluid environment and provide business facing application support, monitoring, analytics, and SLA management.
- Associate's degree or industry certification in an applicable IT field, in addition to four years applicable experience in the design/administration/support of one or more platforms; or Bachelor's degree in an IT field, in addition to two years applicable experience in the design/administration/support of one or more platforms; or five years equivalent in depth experience in the above related areas
- 5 or more years of experience as a Systems Engineer or Site Reliability Engineer
- 2 or more years of experience with ops automation using a scripting language such as Python or Ansible
- Experience with Dynatrace APM and synthetic monitoring
- Experience with airline applications and infrastructure technology is a plus
- Execute on the Incident, Change Management, Problem Management processes
- Building and supporting a reliable application suite for the environment in order to meet the development and maintenance requirements of systems/platforms.
- Ensure platform performance and availability meet enterprise objectives through monitoring, timely service restoration, and tuning
- Constantly working to improve and implement automation of applications tasks
- Providing technical support for systems/platforms according to application SLA's.
- Responsible for designing and developing resiliency in the application code, troubleshooting incidents, engaging with squads to address failure patterns, and participating in incident management.
- Focus on resolving issues before they become incidents
- Identify and articulate severity of impacts using provided monitoring tools and escalate as needed
- Able to understand architecture and design of applications and identify or narrow focus for an incident based on symptoms
- Perform root cause analysis to quickly recover from service interruptions, and to prevent recurring problems
- Monitor, manage, and tune platforms to ensure expected availability and performance levels are achieved
- Identify gaps in monitoring or documentation and reaches out to appropriate teams to fill those gaps
- Implement changes to platforms with minimal impact to the business by following enterprise standards and procedures
- Design and document enterprise standards and procedures
- Knowledge of the theories and methodologies of reliability engineering; ability to design, develop and support various tools, services and applications to maintain a reliable site environment.
- Performance Measurement and Tuning: Knowledge of system performance, testing and programming; ability to monitor, measure, and optimize system performance and network communication.
- CI/CD Pipeline: Knowledge of concepts, values and tools applied in building Continuous Integration(CI), Continuous Delivery and Continuous Deployment(CD) pipeline
- ability to design, build, implement and maintain CI/CD pipelines to achieve the automation of software delivery process.
- Software Release Management: Knowledge of strategies, practices and tools for managing versions and distribution of software products and enhancements; ability to evaluate and improve release management practices and tools
- Application Maintenance: Knowledge of production applications; ability to monitor application functions and resolve issues to maintain optimal conditions for system applications.