When Assistant Professor of Applied Science Dan Runfola and his geoLab (Geospatial Evaluation and Observation Lab) team asked the Research Computing group (previously called High Performance Computing) for 40 TB of storage space, Executive Director Eric Walter knew they were going to need some extra attention. 40 TB is equivalent to storing 20,000 hours of movies or 4 times the amount of data that the Hubble Space Telescope produces per year.
Luckily, the Research Computing group has a lot of space available. Currently, they have a cluster called SciClone consisting of over 11,000 processor cores located in the Integrated Science Center. Another cluster called Chesapeake is housed at Gloucester Point at William & Mary’s Virginia Institute of Marine Science. Together the interoperable clusters have a theoretical peak performance of 360 teraflops. This is the equivalent of performing 360 trillion floating point operations per second, with the computational power of more than 10,000 laptops. Clients of W&M’s Research Computing facilities have included physicists, climate modelers, and now the geoLab team.
Wanted: Lots of Storage Space
“Initially the geoLab team explored basic issues, like storage,” says Walter. “They moved to high performance computing (HPC) servers to enable large-scale satellite image processing, like global estimation of forest cover for 100,000 locations across the world.” The geoLab team works with massive data sets using satellite imagery and census data to produce highly accurate maps and quantify global issues such as climate vulnerability. Their work heavily draws upon convolutional neural network technology, which helps in sorting and identifying satellite images. This means the geoLab needs space to store these images — a lot of space.
The geoLab used the HPC servers during its first few years, but the lab recently transitioned over to their own cluster. According to Walter, this transition was made because the previous computing resources aren’t ideal for the geoLab’s needs. “The work done by the geoLab is much more data and I/O [input/output] intensive than traditional scientific research,” he explains. HPC servers excel in computational processing, moderate storage space, and multi-day jobs. In contrast, geoLab projects require low computation, massive storage space, and many small jobs. The new cluster makes querying and moving data much easier than before.
“The geoLab’s cluster was a gift from the Cloudera foundation, which started a collaboration with William & Mary in late 2018,” Walter says. This cluster uses a Hadoop-distributed file system (HDFS) as well as Apache Spark software. The Research Computing group worked with Cloudera to find a reasonable solution “based on price and performance” according to Walter. Unlike traditional HPC servers, the geoLab cluster moves the code to the data instead of the other way around. This is essential for the geoLab’s large data sets because the code is much smaller and takes fewer resources to move.
An Ongoing Collaboration
Transitioning to the Hadoop/Spark cluster did not mean that the Research Computing group was off the hook. “This new cluster represented a number of challenges to us,” Walter recalls. “First, both the hardware and software configuration had a few significant differences compared to our usual Research Computing offerings. For instance, this cluster has a different application network, it had different security requirements, it required interfacing with the campus active directory, and the software stack was completely different.” But Walter says the geoLab team has full support from Cloudera which fixes some issues themselves or gives excellent guidance on how to fix any remaining issues.
Today, the geoLab still uses about 1-2% of HPC processing hours per year. Queries to their data download page draw on HPC servers, and they have several ongoing projects through geoData, geoBoundaries, geoDev, and geoParsing which all use Reseach Computing resources as well. These projects include an investigation into the relationship between crops and conflict in northeastern Nigeria, a program to collect road roughness data, and improved data exploration and mapping tools. With the geoLab’s own Hadoop/Spark cluster, Walter says, “they will be able to work with much larger data sets than before.”