HPC Engineer

Posted Date 8 months ago
Location Saudi Arabia
Discipline Information Technology
Job Reference 31913
Salary 0.0
Job Title: HPC Engineer
Location: Riyadh, Saudi Arab
Role Type: Permanent

The HPC Engineer works with the research scientists, engineers, and collaborates with technical leadership in the design, development, installation, and maintenance of software for the High Performance Computing (HPC) systems. The HPC Engineer is responsible for supporting the planning, implementation, availability, performance, security, maintenance, and repair of high_performance computing infrastructure.

• Support day-to-day operations for the ML/AI team by monitoring computing resource performance, managing configurations, and addressing security administration.
• Apply revisions to system firmware and software.
• Engage and collaborate with vendors to assist with support activities as required.
• Develop new HPC software deployment plans, custom scripts, and testing procedures to ensure operational reliability for the AI researchers.
• Design, install, configure, and perform document management for cluster infrastructure, including operating systems, job schedulers, resource managers, provisioning managers, configuration managers, network devices, and other components for the HPC environment.
• Explore emerging technologies and technical developments to address expanding ML/AI requirements.
• Identify new services and develop implementation plans.
• Stay current with best practices in the HPC field.

• +3 years of experience designing & architecting Linux environments (specifically Linux, HPC).
• Bachelor's degree in computer science, software engineering, or a related field.
• Experience in managing/administering Linux and Windows server environments for scientific computing.
• Understanding of GPU and accelerator technologies.
• Experience of managing high volumes of servers.
• Experience with HPC cluster job schedulers such as SLURM, LSF.
• Working knowledge of cluster configuration managements tools such as Ansible, Puppet, Salt.
• Knowledge of some of the following: Kubernetes, GitLab, CI/CD, Docker, Grafana, Prometheus, etc