High performance computing

NOTE: We are in the process of trialing this service to users so that we can make the service as accommodating and secure as possible. This means that items concerning the service, including this documentation, are subject to change. We will do our best to keep everyone updated and notified of changes as they come.

Introduction

At the OCF we offer a High Performance Computing (HPC) service for individuals and groups that need to run computationally demanding software. We currently have one main HPC server; however, we have plans to expand the cluster to make use the resources at our disposal.

Gaining Access

In order to access the HPC cluster, please send an access request to help@ocf.berkeley.edu. Make sure to include your OCF username or group account name and a detailed technical description of the projects you plan to run on our HPC infrastructure. This would include information about the nature of the software being run, as well as the amount of computational resources that are expected to be needed.

Connecting

Once you submit your proposal and are approved access you will be able to connect to our Slurm master node via SSH by running the following command:

ssh my_ocf_username@hpcctl.ocf.berkeley.edu

If you have trouble connecting please contact us at help@ocf.berkeley.edu, or come to staff hours when the lab is open and chat with us in person. We also have a #hpc_users channel on slack and irc where you can ask questions and talk to us about anything HPC.

The Cluster

As of Fall 2023, the OCF HPC cluster is composed of one server, with the following specifications:

  • 2 Intel Xeon Platinum 8352Y CPUs (32c/64t @ 2.4GHz)
  • 4 NVIDIA RTX A6000 GPUs
  • 256GB ECC DDR4-3200 RAM

The current hardware were funded with our ASUC budget and the GPUs were gifted by NVIDIA through the [NVIDIA Academic Hardware Grant Program] (https://developer.nvidia.com/higher-education-and-research).

Slurm

We currently use Slurm as our workload manager for the cluster. Slurm is a free and open source job scheduler that evenly distributes jobs across an HPC cluster, where each computer in the cluster is referred to as a node. The only way to access our HPC nodes is through Slurm.

Detailed documentation for how to access Slurm is here.

Dependencies

For managing application dependencies, you currently have two options:

Virtual Environments

First you can use a virtual environment if you are using Python packages. To create a virtual environment navigate to your home directory and run the following commands:

virtualenv -p python3 venv
. venv/bin/activate

This will allow you to pip install any Python packages that the OCF does not already have for your program.

Singularity

For those who need access to non-Python dependencies or have already integrated their program into Docker, the second option is to use Singularity containers. Singularity is a containerization platform developed at Lawrence Berkeley National Laboratory that is designed specifically for HPC environments. To read more about the benefits of Singularity you can look here. We suggest a particular workflow, which will help simplify deploying your program on our infrastructure.

Installing

We recommend that you do your development on our HPC infrastructure, but you can also develop on your own machine if you would like. If you are running Linux on your system, you can install Singularity from the official apt repos:

sudo apt install singularity-container

If you do not have an apt based Linux distribution, installation instructions can be found here. Otherwise, if you are running Mac you can look here, or Windows here.

Building Your Container

singularity build --sandbox ./my_container docker://ubuntu

This will create a Singularity container named my_container. If you are working on our infrastructure you will not be able to install non-pip packages on your container, because you do not have root privileges.

If you would like to create your own container with new packages, you must create the container on your own machine, using the above command with sudo prepended, and then transfer it over to our infrastructure.

The docker://ubuntu option notifies Singularity to bootstrap the container from the official Ubuntu docker container on Docker Hub. There is also a Singularity Hub, from which you can directly pull Singularity images in a similar fashion. We also have some pre-built containers that you may use to avoid having to build your own. They are currently located at /home/containers on the Slurm master node.

Using Your Container

singularity shell my_container

The above command will allow you to shell into your container. By default your home directory in the container is linked to your real home directory outside of the container environment, which helps you avoid having to transfer files in and out of the container.

singularity exec --nv my_container ./my_executable.sh

This command will open your container and run the my_executable.sh script in the container environment. The --nv option allows the container to interface with the GPU. This command is useful when using srun so you can run your program in a single command.

Working on HPC Infrastructure

If you were using a sandboxed container for testing, we suggest you convert it to a Singularity image file. This is because images are more portable and easier to interact with than sandboxed containers. You can make this conversion using the following command:

sudo singularity build my_image.simg ./my_sandboxed_container

If you were working on the image on your own computer, you can transfer it over to your home directory on our infrastructure using the following command:

scp my_image.simg my_ocf_username@hpcctl.ocf.berkeley.edu:~/

To actually submit a Slurm job that uses your Singularity container and runs your script my_executable.sh, run the following command:

srun --gres=gpu --partition=ocf-hpc singularity exec --nv my_image.simg ./my_executable.sh

This will submit a Slurm job to run your executable on the ocf-hpc Slurm partition. The --gres=gpu option is what allows multiple users to run jobs on a single node so it is important to include. Without it, you will not be able to interface with the GPUs.