Two site failover / load distribution

Two site failover / load distribution

Post by Can200 » Sat, 19 Mar 2005 06:49:19

I'm trying to put together a low cost HA solution based around two sites
each with xDSL connections. At each site I have a web server & DNS server
running Bind 9. My goal is to provide a solution that distributes users
across the two sites and as seamlessly as possible, copy with either site

I had originally hoped to achieve this using Round Robin, although searches
on Usenet indicate this will satisfy the load distribution requirement, but
not the failure requirement. An alternate approach would be to make the DNS
servers at both sites masters and hold A records only for the relevant site,
combined with a low TTL (e.g. the A record on the DNS server at site one
points only at the web server at site one; similar for site two). This
addresses failures, but not load distribution.

Having searched further it sounds as though something like lbnamed may be
the solution, but I wondered what experiences others had on the NG?


Two site failover / load distribution

Post by Kevin Darc » Wed, 23 Mar 2005 09:24:51

It _technically_ meets the failure requirement, but some browsers take
so ridiculously long to do address failover that in practical terms
there is no failover. The browser user gives up before the failover
actually occurs.

In this model, load distribution will occur as a rough function of how
quickly the respective *nameservers* respond (or whether they respond
at all, hence the implicit failover capability in case of total site
failure). But this probably has little or no bearing on how quickly the
*webservers* or other application-level components respond, so you may
find that even under normal situations, your traffic is heavily skewed
to one site or the other.

Also, you can have a situation where the nameserver at one of the sites
is up and running fine, there is network connectivity to the site, but
the webserver or some other component(s) at the site is down. This
dual-master model can be refined to have an automatic process which
monitors the infrastructure and changes the relevant A record --
possibly using the Dynamic Update protocol -- if one site or another
becomes non-functional. Of course, at that point one is starting to
re-invent commercial load-balancing technology...

Never used lbnamed. We use commercially-available load-balancing
devices. However, even with those we end up having to reduce our TTLs to
anti-social levels in order to get the load-balancing and/or failover
granularity we require. A-record-based load-balancing/failover is always
going to be quite imperfect. SRV-record-based load-balancing/failover
shows more promise, but client-software (e.g. browser)
developers/providers are taking a long time to adopt it.

- Kevin