TCP connection hang problem (resumes upon new TCP request)

TCP connection hang problem (resumes upon new TCP request)

Post by changxu » Mon, 19 Dec 2005 06:27:53

I'm running a simulation with one client machine and four server
machines (all in the same LAN and running Fedora Core 2 with kernel
2.6.5-1.358smp). The client sends about 1.2 million requests (each of
size 432 bytes) through a TCP connection to server and server reads it.

In my first simulation, the client randomly distribute each request to
one of the four servers and it works fine. However, in my 2nd
simulation, where the clients sends all the requests to a central
distributor (running on one of the servers) and the central guy then
distribute the requests to the four servers, the TCP connection between
the client and the central distributor seems to hang, after sometime
(from a few minutes to half an hour). The client stops writing requests
to the socket and the central guy stops reading from the socket.
But, if I launch any other TCP connection request (e.g., telnet
xx.xx.xx.xx 80) to the central distributor machine from another desktop
machine, the program resumes from wehre it hangs (client starts to
write the socket and the central distributor starts to read the sockets
again), although it would hang after a while again unless I redo
another tcp connection to that machine.

Anyone could provide a clue/hint to solve this problem? Thanks. BTW, I
do observe that there are about 12 tcp connections in the TIME WAIT
status on the central distributor server, it is from another thread of
the server process where it periodically opens a new socket, sends a
performance report through that socket to a remote machine, and then
closes the socket immediately. I guess it should not be the reason of
the above problem but not quite sure.

TCP connection hang problem (resumes upon new TCP request)

Post by Enrique Pe » Mon, 19 Dec 2005 08:42:07

Could you arrange for a computer running ethereal or tcpdump, so
one could learn something about the state of the tcp protocol
at the points where it hangs? You know, window sizes, tcp options,
presence of PUSH bit, if the last request is acknowledged, are
there retransmissions, etc.

Also, save a copy of the kernel counters, preferably at points in
time when you know which packets in the capture are included in
the count. Is it possible to signal the programs involved, and have
them report the count of requests sent or received up to that

If an unrelated connection request from another workstation
seems to unlock the situation, is there any chance that any
iptables module is involved? Does udp have similar effects?
icmp or ping? I see you have port 80 in the example telnet
command, may I guess that port 80 is involved in the simulation
too? When the system hangs, what are the processes involved
doing? waiting in "select()" or "poll()"? Have you tried to
attach it with strace? man strace, option "-p".

I really have no idea, but until somebody shows up with one,
it seems natural to do whatever possible to characterize the
situation and find bounds to the problem. Is the hang in
the kernel, in iptables, or in the application? If the process
sleeps on select, are the right file descriptors present in
the fdsets? (strace.)

Twelve connections does not sound overwhelming.