Site Reliability Engineer
Apply NowJob details
Job Description We are seeking a proactive Site Reliability Engineer (SRE) to drive reliability, performance, and efficiency across our systems and platforms. You'll work closely with Application Development, QA, Product, and Data Engineering teams to champion a DevOps/SRE culture rooted in automation, observability, and continuous improvement. Key Responsibilities: Collaborate cross-functionally to promote SRE and DevSecOps best practices across the organization. Build and maintain reliable, scalable systems with a focus on availability, performance, and resiliency . Establish and monitor SLOs/SLIs , and develop comprehensive dashboards to support decision-making from both technical and business perspectives. Lead efforts to reduce toil through automation, self-healing systems, and advanced monitoring (e.g., synthetic monitoring, RUM). Apply observability and reliability testing practices from architecture through operations, leveraging Agile and product-based models. Drive the adoption of cutting-edge tools in observability, automation, platform engineering, AIOps, and MLOps. Contribute to and lead Communities of Practice (CoP) and SRE Office Hours to foster knowledge sharing and continuous improvement. Qualifications: SRE & DevOps Expertise: Strong experience in observability, toil reduction, incident response, and performance optimization. Proficient with monitoring tools such as Dynatrace , CloudWatch , and Azure Monitor . Skilled in IaC, CaC, JSON, and scripting with Python , Node.js , Ruby , PowerShell , and Shell . Deep understanding of Dynatrace advanced features: DT Guardian, RUM, Synthetic Monitoring, AI-based event correlation . Cloud & Automation: Expert in AWS Cloud services: CDK, Lambda, CloudWatch, EKS, EC2, ELB, S3, SSM . Experience with log ingestion pipelines (AWS Firehose, Dynatrace OpenPipeline), and operational dashboards. Hands-on experience with Ansible Tower , AWS SSM , Bitbucket/GitHub , and CI/CD workflows . Orchestration & Data: Familiarity with orchestration tools like Step Functions , Apache Airflow , and container platforms. Knowledge of data pipelines, data lakes, and databases (Redshift, RDS, Aurora, PostgreSQL, SQL Server, Oracle). Leadership & Communication: Strong problem-solving and knowledge management skills. Effective communicator who bridges technical and business teams. Collaborative, inclusive leader who builds high-performing teams and fosters a culture of growth and recognition.
Apply Now