Senior Devops Engineer

Full Time Posted 2 days ago

Closes: August 29, 2026

Apply now

Employment Info

Industry
Developer/ Engineer
Job type Full Time
Deadline 29/08/2026
Updated 30/05/2026
Location Sydney NSW

JOB DETAILS

Requirements

5+ years of hands-on experience in DevOps, SRE, or Cloud Engineering
Extensive expertise in AWS cloud platforms and services
Practical experience with Kubernetes and containerisation technologies
Strong scripting and automation skills with Bash, Python, or Go
In-depth knowledge of CI/CD tools including Jenkins, GitHub Actions, GitLab CI/CD, and ArgoCD
Solid experience with Infrastructure as Code tools including Terraform and CloudFormation
Comprehensive understanding of Linux administration and networking fundamentals
Experience implementing security best practices including IAM, SSL/TLS, and compliance frameworks such as SOC2, ISO 27001, and GDPR
Proficiency in monitoring and logging tools including the ELK Stack, Prometheus, Grafana, or Datadog
Exceptional problem-solving skills and the ability to operate in a fast-moving, ambiguous environment
Strong communication and collaboration skills to work effectively across cross-functional teams, including client stakeholders

Responsibilities

Architect, build, and continuously enhance CI/CD pipelines to automate and accelerate software delivery across the team
Lead the management and optimisation of cloud infrastructure (AWS), ensuring scalability, security, and reliability while championing best practices
Design, implement, and maintain Infrastructure as Code (IaC) with tools such as Terraform and CloudFormation, enabling the team to deploy with confidence and agility
Proactively monitor, troubleshoot, and enhance system performance, availability, and security, ensuring operational excellence across client environments
Drive the adoption of containerisation and orchestration technologies like Docker and Kubernetes to enable scalable, high-performance solutions
Improve system observability by implementing advanced logging, monitoring, and alerting with tools such as Prometheus, Grafana, Datadog, CloudWatch and the ELK stack
Lead the implementation of security best practices, including IAM, secrets management, and vulnerability assessments
Collaborate closely with developers to continuously optimise build, deployment, and scaling strategies for seamless integration and continuous delivery
Automate key operational tasks and apply SRE principles to enhance system reliability, uptime, and overall performance
Take ownership of incident response and lead root cause analysis for production issues, ensuring swift resolution and ongoing improvement
Practise LLMOps: implement prompt versioning, model evaluation pipelines, and controlled promotion gates before anything reaches production
Instrument beyond standard metrics: design observability for token costs, inference latency, retrieval quality, and model drift detection
Build agentic resilience: implement rate limiting, circuit breakers, and graceful fallbacks for non-deterministic LLM APIs
Own inference cost engineering: design throughput management, caching strategy, and cost-per-query alerting to keep AI systems economically viable at scale
Design AI-native CI/CD pipelines with evaluation harnesses and golden dataset regression tests baked in before any model or prompt change reaches production