Site Reliability Engineer (The Reliability Guardian)
Tech Stack
Job Description
Are you passionate about building and maintaining resilient systems that ensure high availability and performance?
Do you excel at automating processes, troubleshooting complex issues, and creating systems that scale smoothly?
If you’re ready to take on the challenge of ensuring reliable, efficient, and secure system operations, our client has the perfect role for you.
We’re looking for a Site Reliability Engineer (aka The Reliability Guardian) to enhance system reliability, implement automation, and support a seamless user experience.As a Site Reliability Engineer at our client, you’ll collaborate with developers, DevOps engineers, and IT specialists to build infrastructure that is both resilient and scalable.
Your expertise in monitoring, automation, and performance optimization will be crucial for maintaining system uptime and supporting continuous improvement.
You’ll play a vital role in making sure that services are reliable, efficient, and prepared to handle the demands of the future.Key Responsibilities: Ensure System Reliability and Performance: Design and implement strategies to enhance system reliability and performance, focusing on scalability and redundancy.
You’ll ensure high availability across distributed systems and proactively address potential issues.
Automate Processes and Improve Efficiency: Develop automation scripts and tools to reduce manual interventions and improve deployment, monitoring, and maintenance processes.
You’ll leverage tools like Ansible, Puppet, or custom scripts to enhance automation.
Monitor and Respond to System Health: Implement and manage monitoring solutions such as Prometheus, Grafana, or Datadog to track system health.
You’ll set up alerts, dashboards, and automated responses to maintain optimal performance and detect potential failures early.
Incident Management and Troubleshooting: Lead incident response efforts to quickly address and resolve service disruptions.
You’ll document incidents and contribute to post-mortem analysis to prevent future occurrences and refine operational procedures.
Collaborate on System Architecture and Scalability: Work with engineering and development teams to design and scale infrastructure.
You’ll contribute to decisions on architectural improvements and provide input on capacity planning and load testing.
Implement Security and Compliance Standards: Integrate security practices into the reliability workflow, ensuring that all automated processes, monitoring solutions, and operational systems meet compliance and security standards.
Develop and Maintain CI/CD Pipelines: Support and improve continuous integration and deployment pipelines to facilitate smooth code releases.
You’ll ensure that pipelines are optimized for speed, reliability, and scalability.