Remote Job..!! Lead Site Reliability Engineer (GCP – Google Cloud) in South Africa
Hi,
I'm excited to share that one of our clients in UK is hiring for a Lead Site Reliability Engineer (GCP – Google Cloud) in South Africa! It's a fully remote job. Below are the job details. If you're interested, please send your CV to apply.
Title: Lead Site Reliability Engineer (GCP – Google Cloud)
Location: South Africa
Duration: Permanent, fulltime
Job Type: Fully Remote
As the Lead SRE (GCP – Google Cloud), you will drive reliability and scalability across production environments by leading a high-performing SRE team and implementing robust monitoring, automation, and DevOps practices on Google Cloud Platform.
Business Value
- Infrastructure Performance, Scaling & Optimization
- Observability & Incident Management
- Zero-Downtime Deployments & Rollback Reliability
- Secret Management & IAM Risk Mitigation
- Configuration Drift & Environment Parity
- Application-Level Performance & Engineering Quality
Key Responsibilities
- Own end-to-end system reliability, from cloud resource planning to code-level instrumentation.
- Review and improve backend code for performance, resiliency, and observability (e.g., retries, timeouts, connection pools, logging).
- Architect and scale multi-environment Kubernetes deployments (GKE preferred) for high availability and low drift.
- Define and enforce zero-downtime deployment strategies (canary, blue-green, progressive delivery).
- Collaborate with fullstack teams on release readiness, CI/CD quality gates, and infra-aware feature rollout.
- Harden secret management, IAM policies, and privilege boundaries across apps and services.
- Serve as a hands-on lead in incidents, root cause analysis, and long-term reliability improvements.
- Write and review Terraform modules, Helm charts, or platform tooling (bash/python/go) as needed.
- Lead design reviews and cross-functional decisions that impact both product and platform reliability.
Requirements
- 6+ years of experience across fullstack development, SRE, or platform engineering.
- Proficiency in one or more backend stacks (e.g., Python/Django, Node/NestJS, Go, Java/Spring) and ability to review or contribute code.
- Strong expertise in Kubernetes (GKE preferred) and Helm—can optimize, secure, and debug real-world workloads.
- Strong command of Terraform and IaC workflows, ideally with Terraform Cloud and remote state strategy.
- Solid understanding of GCP or similar cloud provider (IAM, VPCs, CloudSQL, networking, Secret Manager, monitoring).
- Experience implementing or enforcing progressive delivery practices (ArgoCD, Flux, GitOps, CI/CD patterns).
- Proven ability to improve system observability using tools like Datadog, Prometheus, OpenTelemetry.
- Ability to “go deep” into an application repo, identify architectural flaws or infra misuse, and fix or guide others.
- Calm under pressure and experienced in incident management and postmortem culture.
Tools and Expectations
- Datadog- Monitor infrastructure health, capture service-level metrics, reduce alert fatigue through high signal thresholds.
- PagerDuty- Own incident management pipeline. Route alerts by severity and align with business SLAs.
- GKE/ Kubernetes- Improve cluster stability and workload isolation. Define auto-scaling configurations and tune for efficiency.
- Helm / GitOps (ArgoCD/Flux)- Validate release consistency across clusters. Monitor sync status and rollout safety.
- Terraform Cloud- Support DR planning and detect infrastructure changes through state comparisons.
- CloudSQL/ Cloudflare- Diagnose DB and networking issues. Monitor latency, enforce access patterns, and validate WAF usage.
- Secret Management- Audit access to secrets, apply short-lived credentials, and define alerts for abnormal usage.