Recursion ceases for 5-10 minutes at random intervals throughout the day

Recursion ceases for 5-10 minutes at random intervals throughout the day

Post by Bill Sprin » Fri, 15 Feb 2008 07:32:41


Hello all,
I was wondering if anyone has run into a recent problem that has me
scratching my head. We just replaced 6 existing non-authoritative Bind
servers with new hardware and a Bind upgrade. They are all in an OSPF
load-balance solution using Zebra (existing). All of the servers suffer
from a lack of resolution at random times of the day for approximately 5
to 10 minutes. The times are seemingly random and do not necessarily
match another servers failure. For example, one has had 18 dropouts in
the last 24hrs, while another hasn't had any - it may be the opposite
tomorrow. The symptoms are a total loss of resolution with an almost
equal amount of failures via bind stats. All users receive, "Server
Failure".
Each server handles anywhere between 500-1500 qps throughout the
day, under normal load. Problem occurs at all loads.
I've tried port, "monitoring", tcpdumping the traffic, and sifting
through the requests and nothing seems out of the ordinary. Numerous
tweaks of the OS have not helped (state table within limits and then
disabled, firewall deactivated/activated, eth stats good). When the
problems happens I can get onto the machine and it is ok (network
upstream good, routing table hasn't inherited anything new, server calm)
When I turn logging up to a level that can help, named can't keep up.
We are now have a troubleshooting process in the works that
involves different hardware and 9.4.2, environment re-architecture, as
well as, <shiver>, other caching dns software.
Is there a known problem, that I haven't been able to find, that
could be causing this? As I understand the, "Server Failure", message
is a general message, could someone help to point me to the next thing
to try? Any help would be appreciated!

- Bill


Pertinent details:
6 servers - all with 2 dual-core AMD Opterons - 4GB Ram
Centos 5 - Linux dnshost1 2.6.18-53.1.6.el5 #1 SMP Wed Jan 23 11:28:47
EST 2008 x86_64 x86_64 x86_64 GNU/Linux

ISC Bind - 9.4.1-P1
./configure --prefix=/usr/local/bind-9.4.1-P1 --enable-threads
--with-openssl=/usr/local/ssl/
(Built without issue or complaint)
GNU Zebra - 0.95a
./configure --disable-ipv6 --enable-nssa
(also built without issue or complaint)

options {
directory "/usr/local/named";
allow-transfer {"allowed";};
allow-recursion {"allowed";};
version "(We'll leave this out...)";
query-source 10.0.0.4 port 32700;
recursive-clients 10000;
listen-on { 10.0.0.1; 10.0.0.2; 10.0.0.3; 192.168.0.1; };
}

(real world IPs substituted to protect the innocent)
10.0.0.4 - bound to eth for upstream lookup
10.0.0.1, 10.0.0.2, 10.0.0.3 are real world IPs bound to loopback for
ospfd.
192.168.0.1 is management
"allowed" is an ACL built to include customer netblocks.
 
 
 

Recursion ceases for 5-10 minutes at random intervals throughout the day

Post by JINMEI Tat » Sat, 16 Feb 2008 13:00:39

At Wed, 13 Feb 2008 17:32:41 -0500,



I cannot think of a reason, but please let me ask something first.

- according to your description, the queries were not dropped, but
were simply responded with server failure, right?
- how much of memory does named use when this occurs?
- how busy (in terms of CPU utilization) is named when this occurs?
- does this change if you disable threads?

Thanks,

---
JINMEI, Tatuya
Internet Systems Consortium, Inc.

 
 
 

Recursion ceases for 5-10 minutes at random intervals throughout the day

Post by Bill Sprin » Sun, 17 Feb 2008 04:48:24

Thanks for your reply

Correct, the requests themselves were answered but just with, "Server
Failure", messages. (always seemed to respond quickly) When it has
happened to me, I was unable to get anything but the error message,
although the graphs indicate ~100qps getting success (perhaps cache?)

(Graph: http://www.yqcomputer.com/ )

The server itself has been relatively flat when it comes to memory
usage. It sits at about 750M. I can set up a process memory graph if
needed.

The CPU does jump up to 25% CPU load from 10%, during the last spike I
checked.

Unfortunately, I haven't tried Bind without thread support. We have had
good luck with threads in testing and prod (especially with 2xdual
Opterons), so I haven't tried it.

Thanks again!
- Bill





--
Bill Springall
Systems Engineer/UNIX Administrator
Cincinnati Bell/ZoomTown.com/Fuse.net
Email: XXXX@XXXXX.COM
Desk: 513.565.9787
______________________________________________________________

Learn avidly.
Question repeatedly what you have learned.
Analyze it carefully.
Then put what you have learned into practice intelligently.
- Confucius
_______________________________________________________________
 
 
 

Recursion ceases for 5-10 minutes at random intervals throughout the day

Post by JINMEI Tat » Fri, 22 Feb 2008 08:57:58

At Fri, 15 Feb 2008 14:48:24 -0500,


Okay (the graph helps).


Yes, a memory graph would also help.


Okay, some additional questions:
- One common reason for SERVFAIL caused internally is memory
allocation failure. are you sure that named does not hit any
(possibly implicit) limitation of memory usage? For example, (at
least some older versions of) FreeBSD has a relatively small upper
limit of datasize. When this occurs, you should normally see log
messages like this:
error: could not mark server as lame: out of memory
(and you don't have to raise the log level to see them because these
are generally categorized as a pretty high-level error).

- Memory related troubles of BIND9 caching server are often due to
overhead of cache cleaning. Can you identify whether cleaning is
performed while you see the problem? To see this, you may want to
apply the patch attached to this message and add (something like)
the following to the logging statement of named.conf:

channel dblog {
file "db.log" versions 5 size 10M;
severity debug 1;
print-severity yes;
print-time yes;
};
category database { dblog; };

Then you'll see something like this in the "db.log" file (under the
appropriate directory):

20-Feb-2008 15:54:46.145 info: begin cache cleaning, mem inuse 33347457
20-Feb-2008 15:54:46.607 info: end cache cleaning, mem inuse 33380881

The attached patch raises the required log level for the cleaning
related log messages in order to keep the entire log output
reasonably quiet.

Frankly, however, I don't think the cleaning overhead is the main
reason for the SERVFAIL since the overhead normally doesn't result
in the error; it would rather cause query drop.

- It would also be helpful if you can periodically keep track of the
number of recursive clients by executing 'rndc status', and
summarize the result in a graph. Failure of recursion due to
recursive-clients quota would cause SERVFAIL errors, although I
doubt this is the case for you as you seem to specify a pretty high
value for this variable.

If none of the above provides any useful hint, I'd like to identify
detailed cause of SERVFAIL by applying a simple patch (if your
operational environment allows that).

Thanks,

---
JINMEI, Tatuya
Internet Systems Consortium, Inc.
 
 
 

Recursion ceases for 5-10 minutes at random intervals throughout the day

Post by JINMEI Tat » Fri, 22 Feb 2008 09:01:56

At Sun, 17 Feb 2008 00:48:00 +0100 (CET),



The cache maintenance (cleaning) overhead might be related to the
issue, but I personally doubt it's the case for this problem. As I
replied in a separate response, it normally causes query drops, rather
than SERVFAIL errors.

BTW, the "more efficient memory allocator" is enabled by default as of
BIND 9.4 (which the OP uses), so it's not helping, unfortunately.

---
JINMEI, Tatuya
Internet Systems Consortium, Inc.
 
 
 

Recursion ceases for 5-10 minutes at random intervals throughout the day

Post by Bill Sprin » Mon, 25 Feb 2008 03:56:09

> Okay (the graph helps).
>
>> The server itself has been relatively flat when it comes to memory
>> usage. It sits at about 750M. I can set up a process memory graph if
>> needed.
>
> Yes, a memory graph would also help.
> Okay, some additional questions:
> - One common reason for SERVFAIL caused internally is memory
> allocation failure. are you sure that named does not hit any
> (possibly implicit) limitation of memory usage? For example, (at
> least some older versions of) FreeBSD has a relatively small upper
> limit of datasize. When this occurs, you should normally see log
> messages like this:
> error: could not mark server as lame: out of memory
> (and you don't have to raise the log level to see them because these
> are generally categorized as a pretty high-level error).

I checked out our logging server and I haven't seen any references to,
"memory" on any of the machines.
I have added a graph per machine to monitor the memory usage of bind
over time. It has almost a day of soak time. I have put it up with a
few other graphs at:

http://home.fuse.net/springall/bind-022108-022208.html

> - Memory related troubles of BIND9 caching server are often due to
> overhead of cache cleaning. Can you identify whether cleaning is
> performed while you see the problem? To see this, you may want to
> apply the patch attached to this message and add (something like)
> the following to the logging statement of named.conf:
> <snip>

It could be, and I wonder if it is choking on something it is cleaning.
I'm not sure how to determine when they are cleaning their caches.
The server I posted has a default cache cleaning time but a max
cache-ttl and ncache-ttl set to 60 seconds. 3 have this configuration,
1 has cleaning-interval of 30min and no (n)ttl settings, and one has no
cache limiting/cleaning settings (defaults).
The problem seems to move around from one group of
primary/secondary servers to another - with different frequency - very
strange. After staring at these graphs for weeks, something makes me
believe it is a specific record or packet, or non-standard upstream
response/query, that is making it hiccup.
I do want to apply the patch you sent. I will work it into an
upcoming night maintenance on these servers to see what we can find.

> - It would also be helpful if you can periodically keep track of the
> number of recursive clients by executing 'rndc status', and
> summarize the result in a graph. Failure of recursion due to

I have added a client connection graph for all hosts (recur and tcp
clients) and have added it to the web page above. So far they are
hovering between anywhere from 130 to ~500 across all 6.

> If none of the above provides any useful hint, I'd like to identify
> detailed cause of SERVFAIL by applying a simple patch (if your
> operational environment allows that).

That would be great. Let me know of the graphs provide any idea and, if
not, I would be more than willing to introduce this patch in to find the
exact cause. In the mean time, I will work on getting the patch you
sent into a running machine during an upcoming maintenance window.

Thanks for you help!

- Bill



JINMEI Tatuya / ???? wrote:

 
 
 

Recursion ceases for 5-10 minutes at random intervals throughout the day

Post by JINMEI Tat » Fri, 29 Feb 2008 08:17:32

At Sat, 23 Feb 2008 13:56:09 -0500,


According to the graph, memory footprint is pretty stable, so if the
problem happened during this period (did it?), it's probably not a
memory related issue.


Hmm, this doesn't look abnormal either.


At the moment, I still cannot think of a specific possible reason.
Please try the attached patch, which will detail the server failure
cases. named.stats will look like this:

+++ Statistics Dump +++ (1204154136)
success 189635
referral 0
nxrrset 48338
nxdomain 109146
recursion 84185
failure 0
duplicate 0
dropped 520
failure1 0
failure2 0
failure3 0
[...]
failure16 0
failure17 0
failure18 0
failure19 5698
failure20 0
[...]
failure32 0
--- Statistics Dump --- (1204154136)

and a new graph containing all the failureXX may hopefully identify
the central cause of the trouble.

Thanks,

---
JINMEI, Tatuya
Internet Systems Consortium, Inc.