Jobs
My ads
My job alerts
Sign in
Find a job Career Tips Companies
Find

Sre (linux, firmware & server infrastructure)

Lanark
NP Group
£35,000 a year
Posted: 12h ago
Offer description

Contract: Senior Platform Reliability Engineer (Linux, Firmware & Server Infrastructure)

Location: Glasgow (Hybrid - 3 days onsite)

Duration: 6 months

Day Rate: Negotiable (Inside IR35 via umbrella)

Reference: 20460

Overview

We are seeking a Senior Platform Reliability Engineer with deep Linux systems expertise and strong exposure to server hardware, firmware, and low-level infrastructure operations. This role sits within a high-performing enterprise infrastructure team responsible for maintaining and improving the reliability of critical platforms at scale.

The position is heavily focused on resolving complex platform and hardware-related incidents, particularly those escalated from L3 support, with an emphasis on firmware life cycle management, disk encryption, logging, and server configuration (BIOS-level controls) across multi-vendor environments.

This is a hands-off hardware role, requiring strong remote troubleshooting capabilities, excellent communication skills, and the ability to work closely with internal teams and external vendors to drive issues through to resolution.

Key Responsibilities

* Own and manage end-to-end incident resolution for platform and hardware-related issues, including triage, mitigation, escalation, and post-incident review
* Diagnose and troubleshoot Linux OS-level issues arising from hardware faults, firmware changes, or configuration inconsistencies
* Manage and support firmware life cycle processes, including upgrades, validation, and issue remediation
* Work with disk encryption technologies and logging frameworks, ensuring system integrity and auditability
* Maintain and troubleshoot server configuration settings, including BIOS-level parameters across multiple hardware vendors (strong Dell focus)
* Utilize out-of-band management tools (eg, iDRAC, iLO, RACADM, Redfish APIs) for remote diagnostics and recovery
* Analyse vendor logs, support bundles, and telemetry data to identify root causes and remediation paths
* Engage directly with hardware vendors and engineering teams, managing escalations and driving timely resolutions
* Contribute to continuous improvement initiatives, reducing incident recurrence and operational toil
* Produce and maintain high-quality documentation, including runbooks, troubleshooting guides, and knowledge base articles
* Participate in post-incident reviews (RCA) and support improvements in reliability metrics (MTTR, MTTD, SLOs)

Essential Skills & Experience

Strong Linux administration and troubleshooting expertise, including:

* Process and service management
* System logs and diagnostics
* Networking fundamentals
* Package and configuration management

Solid understanding of server hardware and infrastructure, including:

* Disks, RAID/HBA controllers
* NICs and firmware interactions
* Hardware failure modes and OS-level symptoms
* Proven experience with:
* Firmware management and upgrades
* Disk encryption and secure configurations
* BIOS/server configuration management

Hands-on experience with remote management and lights-out technologies, such as:

* iDRAC, iLO
* RACADM
* Redfish or similar APIs
* Strong track record of incident ownership, including:
* Triage and mitigation
* Cross-team coordination
* Stakeholder communication
* Driving issues through to resolution
* Experience working with:
* Vendor diagnostics, logs, and support bundles
* Vendor escalation processes and engineering engagement
* Excellent communication skills (written and verbal), with the ability to clearly articulate technical issues to both technical and non-technical stakeholders
* Strong documentation skills, including creation of runbooks, procedures, and RCA reports

Desirable Skills

* Scripting and automation experience (eg, Python, Bash, Ansible)
* Familiarity with configuration management and automation frameworks
* Exposure to virtualisation and containerisation technologies (VMware, KVM, Docker, Kubernetes)
* Experience with monitoring, observability, and alerting systems, including log analysis and alert tuning
* Understanding of SRE principles and metrics, including SLOs, SLIs, error budgets, MTTR/MTTD

Key Attributes

* Methodical and detail-oriented approach to troubleshooting
* Strong sense of ownership and accountability
* Comfortable working in high-pressure, incident-driven environments
* Collaborative mindset with the ability to work across global teams and vendors
* Proactive approach to continuous improvement and operational excellence

Networking People (UK) is acting as an Employment Business in relation to this vacancy.

Apply
Create E-mail Alert
Job alert activated
Saved
Save
See more jobs
Similar jobs
jobs Lanark
jobs South Lanarkshire
jobs Scotland
Home > Jobs > SRE (Linux, Firmware & Server Infrastructure)

About Jobijoba

  • Career Advice
  • Company Reviews

Search for jobs

  • Jobs by Job Title
  • Jobs by Industry
  • Jobs by Company
  • Jobs by Location
  • Jobs by Keywords

Contact / Partnership

  • Contact
  • Publish your job offers on Jobijoba

Legal notice - Terms of Service - Privacy Policy - Manage my cookies - Accessibility: Not compliant

© 2026 Jobijoba - All Rights Reserved

Apply
Create E-mail Alert
Job alert activated
Saved
Save