You might have noticed that the network was down a couple weeks ago. Okay, let's be honest, of course you noticed. When the network is down there is always a huge impact on campus. This time around a major piece of equipment called the load balancer failed. It disrupted Outlook service, the W&M Website, Banner, Blackboard and anything else that you had to use your WMuserid to access.
So what is a load balancer? Essentially, the load balancer directs network traffic to servers. Clarke Morledge, IT's Lead Network Architect, explains by comparing the load balancer to a waiting line for an amusement park ride. Imagine the Curse of DarKastle ride at Busch Gardens. The waiting line starts in the garden and then snakes through the castle. When you are getting close to boarding the ride, there is a ride operator that separates passengers into one of four smaller lines. You will hear "Go to line 1" or "Go to line 3" depending on how many people are in your party and how many people are already waiting in the various lines. The ride operator is trying to distribute the passengers evenly and in a way that maximizes the load that each car will carry.
A load balancer is like the ride operator at DarKastle. "The load balancer is the first thing that network traffic hits as it enters the William & Mary network. It takes all the information waiting in line and divvies it up between the various servers for purposes of efficiency – and it performs this task at a very fast speed," Morledge explains. "Without the direction of the load balancer (aka ride operator) the information doesn't know where to go. It just gets stuck."
Sometimes Issues Arise
At William & Mary we have two load-balancing units, a primary and a back-up. Therefore, in theory, the system should be fault-tolerant. If one goes down, the other should pick-up the slack. The week prior to the outages IT Engineers noticed a problem with the back-up unit and replaced it - without incident. Then something unforeseen happened...
The following week, intermittent problems occurred with the primary load balancer and we expected the newly replaced back-up unit to support the network. However, this was not the case. Despite previous testing that would suggest otherwise, the back-up unit didn't work... at all. We were in the unfortunate position to have both load balancers in a degraded state: one was only partially operational and the other was non-operational. The network was officially down.
IT Engineers played a vital role in restoring campus connectivity. They reconstructed the primary load balancer and got it working again. This provided a temporary reprieve while we awaited support and new equipment from the vendor of the load balancers, a company named Citrix. Otherwise, the outage could have lasted several days.
We are now deciding how to address this problem so that this situation doesn't occur again. These units are getting close to the end-of-life anyway as both units (the replacement unit as well) are about four years old. The standard life span is about 5-7 years. "We were already actively looking at replacing the load balancers and having the new ones in place by August of this year," says Chief Information Officer, Courtney Capenter. "We haven't yet decided whether or not to use the same vendor or to buy from another. We may also look at redesigning the network architecture to create a third layer of fault-tolerance in our load balancing system."