Job Overview:
As a Cloud Engineer – Operations, you will be responsible for maintaining cloud infrastructure to ensure the reliability, availability, and performance of cloud-based services. You will work closely with cross-functional teams to support, troubleshoot, and optimize cloud environments, leveraging Azure DevOps, automation, and best practices for efficient operations.
Key Responsibilities:
Cloud Infrastructure Management:
- Follow Standard Operating Procedures (SOPs) to manage cloud infrastructure on Azure.
- Ensure the scalability, availability, and performance of cloud-based services.
- Implement Infrastructure as Code (IaC) practices using Terraform and Azure Resource Manager (ARM) templates.
Monitoring & Service Management:
- Implement and manage monitoring solutions (e.g., Prometheus, Grafana, Azure Monitoring Service).
- Proactively identify and address performance issues using logs and metrics.
- Handle incident response, troubleshooting, and problem resolution within SLA timelines.
- Perform change management, problem management, and represent changes in CAB (Change Advisory Board).
- Gather and analyze system metrics to assist in performance tuning and fault resolution.
Automation & Scripting:
- Develop and maintain automation scripts for cloud operations using Python, Bash, PowerShell, and Go.
- Implement configuration management tools (e.g., Ansible, Puppet, Terraform) to ensure consistent and automated system configurations.
- Leverage Azure DevOps pipelines for continuous deployment and automation.
Security & Compliance:
- Implement and enforce security best practices for cloud environments.
- Ensure compliance with industry standards, regulatory requirements, and security frameworks (e.g., CIS Benchmarks, ITIL, NIST).
- Manage and secure Public Key Infrastructure (PKI) for cloud-based authentication and encryption.
Collaboration & Communication:
- Work closely with DevOps, security, and development teams to align cloud infrastructure with business requirements.
- Provide technical guidance to support teams and cross-functional stakeholders.
- Participate in cross-functional team discussions to solve technical challenges.
Performance Optimization:
- Identify and implement cost and performance optimizations for cloud resources.
- Conduct capacity planning, performance assessments, and system benchmarking.
- Proactively address performance bottlenecks and infrastructure improvements.
Disaster Recovery & Backup:
- Design and implement disaster recovery (DR) strategies for cloud environments.
- Conduct regular DR testing to validate backup and recovery processes.
- Implement backup policies ensuring business continuity in case of failures.
Documentation & Reporting:
- Create and maintain comprehensive documentation for cloud configurations, processes, and procedures.
- Generate operational reports on system health, performance metrics, and incidents.
Qualifications & Requirements:
- Bachelor’s degree in Computer Science, Information Technology, or related field.
- 6+ years of experience in Cloud Operations Engineering or a similar role.
- Strong expertise in Azure Cloud and experience with other platforms like AWS or GCP is a plus.
- Proficiency in scripting & automation (Python, Bash, PowerShell, Go).
- Experience working with Azure DevOps, GitLab CI/CD pipelines, and automation workflows.
- Hands-on experience with Infrastructure as Code (IaC) tools such as Terraform, ARM templates.
- Familiarity with containerization & orchestration (Docker, Kubernetes, Helm, ArgoCD).
- Experience with monitoring & logging tools (Prometheus, Grafana, Azure Monitoring Service, ELK Stack, Graylog/Loki).
- Strong knowledge of cloud security best practices.
- Problem-solving mindset with strong troubleshooting skills.
- Excellent communication & collaboration skills to work effectively in cross-functional teams.
- Azure Certifications: Microsoft Certified: Azure Administrator, Azure Solutions Architect, ITIL (preferred).