Senior Site Reliability Engineer
Skills
About the Role
As a Site Reliability Engineer Senior you will ensure the reliability, availability and performance of our systems. You will monitor and manage incidents, automate remediation, and implement scalable solutions. You will work with the Development and Infrastructure teams to optimize the lifecycle of applications from design to operation, use Terraform to manage infrastructure as code on AWS, and build dashboards with Grafana and Prometheus to visualize metrics. You will help improve DevOps and SRE practices, promoting automation and fast feedback, and you will keep security and compliance considerations in mind.
Requirements
- Minimum of 5 years of experience as a software engineer or SRE with a focus on high availability in financial operations
- Proficiency in Kubernetes for container orchestration and cluster management
- Strong knowledge of AWS and its services such as EC2, RDS and S3
- Experience with Infrastructure as Code IaC especially Terraform
- Experience with monitoring and observability using Prometheus Grafana and other metrics tools
- Strong automation of infrastructure CI CD and DevOps practices
- Experience troubleshooting and resolving complex production issues
- Knowledge of security and compliance practices in cloud environments
- Desirable Familiarity with Grafana Loki
- Desirable Experience with Docker and containerized environments
- Desirable Knowledge of scripting languages such as Python GoLang or Bash
- Desirable Experience with CI/CD tools like GitHub Actions
- Desirable Knowledge of RabbitMQ or Kafka
- Desirable FinOps practices for cloud cost optimization
- Desirable Knowledge of financial markets and or cryptocurrencies will be a differentiator
- Desirable Experience with security of information applied to operations SecOps
Responsibilities
- Ensure reliability, availability and performance of systems by automating processes and implementing scalable solutions
- Monitor and manage infrastructure incidents, ensuring rapid resolution of critical issues and building automation to prevent recurrence
- Collaborate with Development and Infrastructure teams to optimize the lifecycle of applications from design to operation
- Implement and maintain infrastructure as code using Terraform
- Monitor and optimize AWS resource usage and implement cost efficient practices
- Create dashboards and reports with Grafana and Prometheus to visualize metrics
- Contribute to continuous improvement of DevOps and SRE practices by promoting automation and fast feedback
Benefits
- Health plan SulAmérica
- Dental plan SulAmérica
- Life insurance Prudential
- Swile card – 1168.64 BRL
- Transport allowance or home office stipend
- Learning incentives such as workshops courses and language programs after 3 months
- Payroll loan after 3 months
- One day off in the week of your birthday
- Discounts on trading fees
- Referral program
- PLR – Profit sharing
