Using the W&M kubernetes cluster
Using the W&M Kubernetes (K8s) cluster
NOTE: You must request access to the k8s cluster via hpchelp once you have your hpc account.
Overview
The Research Computing (RC) Kubernetes cluster is a newer resource compared to the traditional HPC/Slurm batch cluster. It was created to support research workflows that benefit from containerized environments, reproducibility, and flexible scaling. Singluarity based Containers can be used on the W&M HPC/Slurm batch clusters, however, this is the minority of jobs. In Kubernetes, containers are used in all workloads.
Both systems let you run scripts to launch workloads, but the style and structure differ:
-
HPC/Slurm: Jobs are typically submitted via shell scripts with directives (e.g.,
#SBATCHflags) that define resources and runtime behavior. -
Kubernetes: Workloads are defined in YAML files, which describe the desired state of your application (resources, containers, and policies).
In short: HPC uses directives in scripts, while Kubernetes uses declarative configuration.
Why Use Kubernetes?
-
Containerization: Run software in reproducible environments without worrying about module systems or dependencies.
-
Scalability: Spin up multiple pods/jobs easily for parallel workloads.
-
Flexibility: Good for workflows involving services (databases, dashboards, ML model serving) that HPC batch queues don’t handle well.
Namespaces
Namespaces in Kubernetes provide a way to organize and isolate workloads, so different research groups or applications don’t interfere with each other. Namespaces is just an abstraction of a username. In fact most namespaces in our k8s cluster are just usernames. In our k8s cluster there are two types of namespaces:
User namespaces- Named after your username (e.g., jdoe).
- Intended for short running jobs (max walltime must be less than 5 days)
- Restriction: You may only run Jobs here — pods are not allowed.
- Shared by a group of users working on the same project.
- Suitable for production or collaborative workloads.
- You can run both Pods and Jobs here and pods have no restriction on walltime.
Pods vs. Jobs
In most Kubernetes clusters:
-
Pods are the fundamental unit. A pod usually runs one container (though it can run multiple tightly coupled ones). Think of a pod as “one compute job” on HPC, but without a built-in runtime limit.
-
Jobs in Kubernetes are a higher-level object that manages pods. They ensure that a task runs to completion—restarting pods if necessary. A Job is closer to an HPC batch job, where the system ensures your work finishes, even if something fails mid-way.
Key differences to note:
-
Like the Slurm/Batch clusters, jobs have a maximum walltime (currently 5 days for k8s jobs)
-
In Slurm, the scheduler directly allocates nodes/cores. In Kubernetes, you request resources (CPU, memory, GPU) via YAML, and the scheduler fits your pod/job onto available nodes.
Here are two examples that do the same work. The first is a pod example, second, a job example. These two objects do nearly the same thing:
-
Creates a Pod named
python-podin theproject1namespace. -
Runs a single container named
python. -
Pulls the image
laudio/pyodbc(default registry:docker.io). -
Starts the container with
/bin/sh -ec "sleep 300"(execute command string; exit on any error). -
Sets restartPolicy: OnFailure (restart only if the container exits non-zero).
-
Runs processes as UID 1719 and GID 1121 via Pod
securityContext(use the Linux command id to find your UID and GID) -
Requests at least 8 GiB RAM, 2 CPU cores, and 1 GPU (for scheduling).
-
Limits the container to 16 GiB RAM, 4 CPU cores, and 1 GPU (hard caps).
-
Mounts NFS volume
data10from serverlunarpath/sciclone/data10/ewalterto/tmp/mydatain the container. -
Mounts NFS volume
scr10from serverscr10path/sciclone/scr10/ewalterto/tmp/my10in the container. -
Declares both NFS volumes under
spec.volumesand references them undervolumeMounts.
Example pod:
(click here to download text version)

For the job example, here is a summary of how it differs from a pod definition:
-
Uses
kind: Job(apiVersion: batch/v1) instead of a Pod; a controller manages Pods to ensure completion. -
Job wraps the Pod under
spec.template(Pod spec lives inside a template). -
Job adds
activeDeadlineSeconds: 60— kills the Job if it hasn’t finished within 60s. -
Job adds
ttlSecondsAfterFinished: 30— automatically deletes Job resources ~30s after it finishes. -
Pod name is generated from the Job (e.g.,
python-job-xxxxx) rather than being fixed likepython-pod. -
Job defaults to retrying failed Pods (backoff, up to a limit) until success or deadline; a plain Pod does not have Job-style retries.
-
Valid Pod restart policies for a Job are
OnFailureorNever(and are set undertemplate.spec), while a standalone Pod can use other patterns but won’t have Job semantics. -
Job tracks completion status (succeeded/failed counts, conditions), which Pods alone don’t aggregate.
Example job:
(click here to download text version)

Using kubectl - the main command for k8s
In Kubernetes, almost everything you do as a user goes through kubectl - it’s the command-line interface that talks to the Kubernetes API server.
To create the Pod:
[36 ewalter@cm ~ ]$ kubectl apply -f webpod.yml
pod/python-pod created
To check the status of the pod:
[37 ewalter@cm ~ ]$ kubectl get pod
NAME READY STATUS RESTARTS AGE
python-pod 1/1 Running 0 7s
To open a bash shell in the pod:
kubectl exec -it python-pod -- /bin/bash
To create the Job:
kubectl apply -f webjob.yml
To check the status of either
kubectl get pods
output:
kubectl get jobs
To get info (describe) about the pod or job (remember, python-pod is what we named the pod in the pod metadata section, and python-job was used for the job)
kubectl describe pod python-pod
kubectl describe job python-job
To view the pod logs
kubectl logs python-pod
To view the job logs
kubectl logs job/python-job
To open a shell in the container in the pod:
kubectl exec -it python-pod -- /bin/bash
special keywords
Accessing storage directories on K8s
NFS mount - users are able to mount and directory for the Slurm batch cluster into their K8s/pods jobs
example
PVC/PV - request storage via a PersistentVolumeClaim. Good for transient/scratch data that will eventually be deleted. (how long does it last?)
Not accessible from Slurm batch cluster
example
Using Python on K8s
Use prebuilt python images (where)
Install miniconda into an NFS or PVC director and use it in pods
Build your own custom image
Docker/Podman push to docker hub
Github actions push to GHCR
Accessing pod from a local brower
proxy + tunnel