JOB DETAILS
Requirements
- 5+ years of hands-on experience in DevOps, SRE, or Cloud Engineering
- Extensive expertise in AWS cloud platforms and services
- Practical experience with Kubernetes and containerisation technologies
- Strong scripting and automation skills with Bash, Python, or Go
- In-depth knowledge of CI/CD tools including Jenkins, GitHub Actions, GitLab CI/CD, and ArgoCD
- Solid experience with Infrastructure as Code tools including Terraform and CloudFormation
- Comprehensive understanding of Linux administration and networking fundamentals
- Experience implementing security best practices including IAM, SSL/TLS, and compliance frameworks such as SOC2, ISO 27001, and GDPR
- Proficiency in monitoring and logging tools including the ELK Stack, Prometheus, Grafana, or Datadog
- Exceptional problem-solving skills and the ability to operate in a fast-moving, ambiguous environment
- Strong communication and collaboration skills to work effectively across cross-functional teams, including client stakeholders
Responsibilities
- Architect, build, and continuously enhance CI/CD pipelines to automate and accelerate software delivery across the team
- Lead the management and optimisation of cloud infrastructure (AWS), ensuring scalability, security, and reliability while championing best practices
- Design, implement, and maintain Infrastructure as Code (IaC) with tools such as Terraform and CloudFormation, enabling the team to deploy with confidence and agility
- Proactively monitor, troubleshoot, and enhance system performance, availability, and security, ensuring operational excellence across client environments
- Drive the adoption of containerisation and orchestration technologies like Docker and Kubernetes to enable scalable, high-performance solutions
- Improve system observability by implementing advanced logging, monitoring, and alerting with tools such as Prometheus, Grafana, Datadog, CloudWatch and the ELK stack
- Lead the implementation of security best practices, including IAM, secrets management, and vulnerability assessments
- Collaborate closely with developers to continuously optimise build, deployment, and scaling strategies for seamless integration and continuous delivery
- Automate key operational tasks and apply SRE principles to enhance system reliability, uptime, and overall performance
- Take ownership of incident response and lead root cause analysis for production issues, ensuring swift resolution and ongoing improvement
- Practise LLMOps: implement prompt versioning, model evaluation pipelines, and controlled promotion gates before anything reaches production
- Instrument beyond standard metrics: design observability for token costs, inference latency, retrieval quality, and model drift detection
- Build agentic resilience: implement rate limiting, circuit breakers, and graceful fallbacks for non-deterministic LLM APIs
- Own inference cost engineering: design throughput management, caching strategy, and cost-per-query alerting to keep AI systems economically viable at scale
- Design AI-native CI/CD pipelines with evaluation harnesses and golden dataset regression tests baked in before any model or prompt change reaches production
Desired Qualifications
- Familiarity with serverless architectures such as AWS Lambda
- Experience with database performance tuning and scaling techniques
- Relevant certifications in AWS, Azure, or GCP DevOps
- Prior experience supporting AI or ML workloads in production environments
- Familiarity with LLM observability tooling such as LangSmith, Weave, or similar
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#DesignFintech #GlobalDesigners
#FintechInnovation #CreativeJobs
#DesignHub
#Tech Meets Design
#DesignerNetwork
#Myausjob