We're seeking an experienced Site Reliability Engineer (SRE) to join our IT SRE team.
• Embraces diverse people, thinking, and styles.
• Consistently makes safety and security of self and others the priority.
• 5 or more years of experience as an application developer or SRE.
• 2 or more years of experience with ops automation using a scripting language such as Python or Ansible.
• Site Reliability Engineering: Knowledge of the theories and methodologies of reliability engineering; ability to design, develop and support various tools, services and applications to maintain a reliable site environment.
• Performance Measurement and Tuning: Knowledge of system performance, testing and programming; ability to monitor, measure, and optimize system performance and network communication.
• CI/CD Pipeline: Knowledge of concepts, values, and tools applied in building Continuous Integration(CI), Continuous Delivery and Continuous Deployment(CD) pipeline; ability to design, build, implement and maintain CI/CD pipelines to achieve the automation of software delivery process.
• Software Release Management: Knowledge of strategies, practices, and tools for managing versions and distribution of software products and enhancements; ability to evaluate and improve release management practices and tools
• Application Maintenance: Knowledge of production applications; ability to monitor application functions and resolve issues to maintain optimal conditions for system applications.
• Software Engineering: Knowledge of software engineering; ability to deliver new or enhanced software products.
• Agile Development: Knowledge of agile methodologies and the agile development lifecycle; ability to utilize formal agile methodologies, disciplines, practices, and techniques for the delivery of new and enhanced applications.
• Container: Knowledge of concept, functions, and capabilities of container tools and techniques; ability to effectively apply containers in various IT business environments
• Cloud Platform: Knowledge of the products and services regarding cloud platforms; ability to utilize related tools and technologies to develop cloud solutions and deploy applications on cloud platforms.
• Bachelor’s degree in Computer Science, Information Technology or related field is preferred.
• AWS Certified SysOps Administrator or AWS Certified DevOps Engineer certification is preferred.
• Experience with an APM tool such as Dynatrace, New Relic, AppDynamics, or Datadog is preferred.
• Experience with airline applications and infrastructure technology is a plus.
• Experience developing ops automation in Tekton pipelines is a plus.
• Experience developing applications and/or automation running in Red Hat OpenShift is a plus.
• Avionics systems experience is a plus
The Site Reliability Engineer (SRE) will work with IT development squads to implement best practices for reliability and performance with the applications and services they support.
Our ideal candidate is well-versed in modern cloud-based and on-prem architecture and experienced in designing systems for reliability as well as implementing monitoring, alerting, and ops automation to reliably operate and maintain the services they build.
The SRE works with developers to improve the Reliability and Resiliency of Software Solutions to meet the business requirements by implementing SRE tools, processes, and best practices. SRE is what happens when you ask a software engineer to design an operations function. The SRE helps to design, develop, test, debug, and automate tasks for applications. They troubleshoot incidents to address failure patterns, automate remediation through runbooks, and document application optimization.
• Building and supporting a reliable application suite for the environment in order to meet the development and maintenance requirements of systems/platforms.
• Working with development teams to evaluate the health, stability and reliability of applications.
• Utilizing monitoring, alerts, dashboards, and management tools to ensure the availability, reliability and performance of applications and services.
• Constantly working to improve and implement automation of applications tasks.
• Providing technical support for systems/platforms according to application SLA's.
• Responsible for designing and developing resiliency in the application code, troubleshooting incidents, engaging with squads to address failure patterns, and participating in incident management.
“All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran.”