Site Reliability Engineer

Location:

Contract Type:

Full Time

Sector:

Infrastructure & Telecommunications

Salary:

£48,000.00 - £50,000.00 Annual

Reference No.

BBBH479980

Remote Job..!! Lead Site Reliability Engineer (GCP – Google Cloud) in South Africa

Hi,

I'm excited to share that one of our clients in UK is hiring for a Lead Site Reliability Engineer (GCP – Google Cloud) in South Africa! It's a fully remote job. Below are the job details. If you're interested, please send your CV to apply.

Title: Lead Site Reliability Engineer (GCP – Google Cloud)
Location: South Africa
Duration: Permanent, fulltime
Job Type: Fully Remote

As the Lead SRE (GCP – Google Cloud), you will drive reliability and scalability across production environments by leading a high-performing SRE team and implementing robust monitoring, automation, and DevOps practices on Google Cloud Platform.

Business Value

Infrastructure Performance, Scaling & Optimization
Observability & Incident Management
Zero-Downtime Deployments & Rollback Reliability
Secret Management & IAM Risk Mitigation
Configuration Drift & Environment Parity
Application-Level Performance & Engineering Quality

Key Responsibilities

Own end-to-end system reliability, from cloud resource planning to code-level instrumentation.
Review and improve backend code for performance, resiliency, and observability (e.g., retries, timeouts, connection pools, logging).
Architect and scale multi-environment Kubernetes deployments (GKE preferred) for high availability and low drift.
Define and enforce zero-downtime deployment strategies (canary, blue-green, progressive delivery).
Collaborate with fullstack teams on release readiness, CI/CD quality gates, and infra-aware feature rollout.
Harden secret management, IAM policies, and privilege boundaries across apps and services.
Serve as a hands-on lead in incidents, root cause analysis, and long-term reliability improvements.
Write and review Terraform modules, Helm charts, or platform tooling (bash/python/go) as needed.
Lead design reviews and cross-functional decisions that impact both product and platform reliability.

Requirements

6+ years of experience across fullstack development, SRE, or platform engineering.
Proficiency in one or more backend stacks (e.g., Python/Django, Node/NestJS, Go, Java/Spring) and ability to review or contribute code.
Strong expertise in Kubernetes (GKE preferred) and Helm—can optimize, secure, and debug real-world workloads.
Strong command of Terraform and IaC workflows, ideally with Terraform Cloud and remote state strategy.
Solid understanding of GCP or similar cloud provider (IAM, VPCs, CloudSQL, networking, Secret Manager, monitoring).
Experience implementing or enforcing progressive delivery practices (ArgoCD, Flux, GitOps, CI/CD patterns).
Proven ability to improve system observability using tools like Datadog, Prometheus, OpenTelemetry.
Ability to “go deep” into an application repo, identify architectural flaws or infra misuse, and fix or guide others.
Calm under pressure and experienced in incident management and postmortem culture.

Tools and Expectations

Datadog- Monitor infrastructure health, capture service-level metrics, reduce alert fatigue through high signal thresholds.
PagerDuty- Own incident management pipeline. Route alerts by severity and align with business SLAs.
GKE/ Kubernetes- Improve cluster stability and workload isolation. Define auto-scaling configurations and tune for efficiency.
Helm / GitOps (ArgoCD/Flux)- Validate release consistency across clusters. Monitor sync status and rollout safety.
Terraform Cloud- Support DR planning and detect infrastructure changes through state comparisons.
CloudSQL/ Cloudflare- Diagnose DB and networking issues. Monitor latency, enforce access patterns, and validate WAF usage.
Secret Management- Audit access to secrets, apply short-lived credentials, and define alerts for abnormal usage.

Share this job

Site Reliability Engineer

Similar Jobs

Sectors

Quick Links

Stay in Touch