Skip to main content
Close menu William & Mary

Using the W&M kubernetes cluster

Using the W&M Kubernetes (K8s) cluster

NOTE:  You must request access to the k8s cluster via hpchelp once you have your hpc account.

Overview

The Research Computing (RC) Kubernetes cluster is a newer resource compared to the traditional HPC/Slurm batch cluster. It was created to support research workflows that benefit from containerized environments, reproducibility, and flexible scaling.  Singluarity based Containers can be used on the W&M HPC/Slurm batch clusters, however, this is the minority of jobs.  In Kubernetes, containers are used in all workloads.

Both systems let you run scripts to launch workloads, but the style and structure differ:

  • HPC/Slurm: Jobs are typically submitted via shell scripts with directives (e.g., #SBATCH flags) that define resources and runtime behavior.

  • Kubernetes: Workloads are defined in YAML files, which describe the desired state of your application (resources, containers, and policies).

In short: HPC uses directives in scripts, while Kubernetes uses declarative configuration.


Why Use Kubernetes?

  • Containerization: Run software in reproducible environments without worrying about module systems or dependencies.

  • Scalability: Spin up multiple pods/jobs easily for parallel workloads.

  • Flexibility: Good for workflows involving services (databases, dashboards, ML model serving) that HPC batch queues don’t handle well.


Namespaces

Namespaces in Kubernetes provide a way to organize and isolate workloads, so different research groups or applications don’t interfere with each other.   Namespaces is just an abstraction of a username.  In fact most namespaces in our k8s cluster are just usernames.  In our k8s cluster there are two types of namespaces:

User namespaces
  • Named after your username (e.g., jdoe).
  • Intended for short running jobs (max walltime must be less than 5 days) 
  • Restriction: You may only run Jobs here — pods are not allowed.
Project namespaces
  • Shared by a group of users working on the same project.
  • Suitable for production or collaborative workloads.
  • You can run both Pods and Jobs here and pods have no restriction on walltime.
This separation helps ensure that testing and one-off jobs don’t interfere with shared project workloads.

Pods vs. Jobs

In most Kubernetes clusters:

  • Pods are the fundamental unit. A pod usually runs one container (though it can run multiple tightly coupled ones). Think of a pod as “one compute job” on HPC, but without a built-in runtime limit.

  • Jobs in Kubernetes are a higher-level object that manages pods. They ensure that a task runs to completion—restarting pods if necessary. A Job is closer to an HPC batch job, where the system ensures your work finishes, even if something fails mid-way.

Key differences to note:

  • Like the Slurm/Batch clusters, jobs have a maximum walltime (currently 5 days for k8s jobs)

  • In Slurm, the scheduler directly allocates nodes/cores. In Kubernetes, you request resources (CPU, memory, GPU) via YAML, and the scheduler fits your pod/job onto available nodes.

Here are two examples that do the same work.   The first is a pod example, second, a job example.   These two objects do nearly the same thing:

  • Creates a Pod named python-pod in the project1 namespace.

  • Runs a single container named python.

  • Pulls the image laudio/pyodbc (default registry: docker.io).

  • Starts the container with /bin/sh -ec "sleep 300" (execute command string; exit on any error).

  • Sets restartPolicy: OnFailure (restart only if the container exits non-zero).

  • Runs processes as UID 1719 and GID 1121 via Pod securityContext (use the Linux command id to find your UID and GID)

  • Requests at least 8 GiB RAM, 2 CPU cores, and 1 GPU (for scheduling).

  • Limits the container to 16 GiB RAM, 4 CPU cores, and 1 GPU (hard caps).

  • Mounts NFS volume data10 from server lunar path /sciclone/data10/ewalter to /tmp/mydata in the container.

  • Mounts NFS volume scr10 from server scr10 path /sciclone/scr10/ewalter to /tmp/my10 in the container.

  • Declares both NFS volumes under spec.volumes and references them under volumeMounts.

Example pod:

(click here to download text version)

 Example of a pod.yml file for k8s

For the job example, here is a summary of how it differs from a pod definition:

 

  • Uses kind: Job (apiVersion: batch/v1) instead of a Pod; a controller manages Pods to ensure completion.

  • Job wraps the Pod under spec.template (Pod spec lives inside a template).

  • Job adds activeDeadlineSeconds: 60 — kills the Job if it hasn’t finished within 60s.

  • Job adds ttlSecondsAfterFinished: 30 — automatically deletes Job resources ~30s after it finishes.

  • Pod name is generated from the Job (e.g., python-job-xxxxx) rather than being fixed like python-pod.

  • Job defaults to retrying failed Pods (backoff, up to a limit) until success or deadline; a plain Pod does not have Job-style retries.

  • Valid Pod restart policies for a Job are OnFailure or Never (and are set under template.spec), while a standalone Pod can use other patterns but won’t have Job semantics.

  • Job tracks completion status (succeeded/failed counts, conditions), which Pods alone don’t aggregate. 

 

Example job:

(click here to download text version) 

 Example of a job.yml file for k8s

Using kubectl - the main command for k8s 

In Kubernetes, almost everything you do as a user goes through kubectl - it’s the command-line interface that talks to the Kubernetes API server.

To create the Pod:
[36 ewalter@cm ~ ]$ kubectl apply -f webpod.yml
pod/python-pod created

To check the status of the pod:
[37 ewalter@cm ~ ]$ kubectl get pod
NAME       READY STATUS  RESTARTS AGE
python-pod 1/1   Running 0        7s

To open a bash shell in the pod:
kubectl exec -it python-pod -- /bin/bash

 

 

To create the Job:
kubectl apply -f webjob.yml

To check the status of either 
kubectl get pods
output:





kubectl get jobs

To get info (describe) about the pod or job (remember, python-pod is what we named the pod in the pod metadata section, and python-job was used for the job)
kubectl describe pod python-pod
kubectl describe job python-job

To view the pod logs
kubectl logs python-pod

To view the job logs
kubectl logs job/python-job

To open a shell in the container in the pod:
kubectl exec -it python-pod -- /bin/bash

 

special keywords

 

 

Accessing storage directories on K8s


NFS mount - users are able to mount and directory for the Slurm batch cluster into their K8s/pods jobs

example

PVC/PV - request storage via a PersistentVolumeClaim.  Good for transient/scratch data that will eventually be deleted.   (how long does it last?)

Not accessible from Slurm batch cluster

example

Using Python on K8s

Use prebuilt python images  (where)

Install miniconda into an NFS or PVC director and use it in pods

Build your own custom image

Docker/Podman push to docker hub

Github actions push to GHCR

Accessing pod from a local brower

proxy + tunnel

 

 

 


  •