Site Reliability Engineer

Site Reliability Engineer

Descrição da Empresa

Reloading® it's a trademark registered in 2014 (owned by Holding TR105) as a result of the achievements obtained since 2009 in training and consultancy services, with over 500 trainees and in collaboration with more than 30 national/international companies. Areas of Activity: •Business Analysis / Demand Management; •Project Management; •Service Management. Portfolio: Reloading's solution portfolio is divided into two areas that respond to the established offering model and are as follows: •Training: All offerings always require a primary assessment to tailor Reloading's approach to the client. This process aims to adapt our offerings to the specific needs and requirements of each client. •Consultancy: Consultancy is aimed at clients of an entrepreneurial nature.

Descrição da Função

Skills and competencies follows: - System Monitoring and Alerting: Proficiency in setting up and managing monitoring systems to track the health, performance, and availability of systems and applications. This includes configuring alerts for proactive incident response. Previous experience working with monitoring backend systems like Datadog, Cloudwatch, Prometheus. - Incident Management: Ability to respond promptly to incidents, troubleshoot issues efficiently, and coordinate with relevant teams to resolve incidents with minimal impact on services. - Cloud Infrastructure: Experience with public cloud platforms such as AWS (most preferable), Azure, or Google Cloud Platform (GCP), including proficiency in managing and optimizing cloud resources for availability, scalability, and cost-effectiveness. - Networking and Infrastructure: Proficiency in networking concepts and infrastructure management, including experience with configuring and maintaining network devices, servers, and storage systems using Linux. - Containerized Workloads: Deploy and manage containerized applications within Kubernetes clusters. More specifically: · Implement best practices for containerization, including Docker image creation, and container registry management. · Scale applications within Kubernetes clusters and implement autoscaling based on resource utilization metrics. · Configure Kubernetes Services for service discovery and load balancing to distribute traffic across application instances. · Implement monitoring and logging solutions within Kubernetes clusters to track the health and performance of applications. · Apply security best practices for securing container images and managing access control within Kubernetes clusters. · Manage computer resources effectively within Kubernetes clusters to optimize performance and cost. · Handle configuration and secret management for applications deployed in Kubernetes. · Troubleshoot and debug issues within Kubernetes clusters, including application failures, networking problems, and performance issues. · Integrate Kubernetes with CI/CD pipelines for automating the build, test, and deployment of containerized applications. - Continuous Integration/Continuous Deployment (CI/CD): Familiarity with CI/CD pipelines and tools like GitHub Actions (most preferable), GitLab CI/CD, or Bitbucket pipelines to automate software delivery processes and ensure reliable and efficient deployment workflows. - Performance Optimization: Skills in identifying performance bottlenecks, analyzing system metrics, and implementing optimizations to improve the overall performance and efficiency of systems. - Automation and Scripting: Strong scripting skills (e.g., Python, Bash) to automate repetitive tasks, streamline workflows, and improve operational efficiency. Familiarity with configuration management tools like Ansible is also valuable. - Security Best Practices: Understanding of security principles and best practices for securing systems and data, and experience with implementing security measures in infrastructure and applications. - Problem-Solving and Troubleshooting: Strong analytical and problem-solving skills to diagnose complex issues, identify root causes, and implement effective solutions to prevent recurrence. - Communication and Collaboration: Excellent communication skills to collaborate effectively with cross-functional teams, including developers, operations, and business stakeholders, and to document processes, procedures, and incident resolutions. - On-Call Support: Willingness and ability to participate in an on-call schedule to provide support during working hours for critical incidents and emergencies, ensuring the availability and scalability of systems and services

Localização

  • Lisboa, Portugal