Hitting weird limit in Tcl 8.4.19

Hitting weird limit in Tcl 8.4.19

Post by Adrian H » Thu, 26 Aug 2010 18:13:59


ne of my client's Tcl-based servers recently experienced strange
hangups, and I was asked to help them diagnose the issue. I wrote the
scripts below to see what happens to a simple echo server when thousands
of echo clients tried to connect to it.

I ran the server script on a Linux box, carefully running "ulimit -n
2048" beforehand to bump up the max number of open files to 2048.

I then ran the client script on a second Linux box with "-instances
1000". The server handled all 1000 connections without any issues,
merrily bouncing messages back and forth.

I then ran the client script on a third Linux box, also with "-instances
1000". To my surprise, the server seemed to lock up after the first 21
connections from the second client. In particular, the "MARK" log line
that the server prints every 10 seconds stops appearing, as if the
server's Tcl event loop simply stopped working.

Questions:

[1] What exactly is happening here? It seems as if I hit a limit related
to 1024 open file descriptors, but as I've already bumped the process FD
limit to 2048 *and* the Tcl event loop stopped working, there's evidently
more than meets the eye.

[2] If this is not a bug on my part, is it a known issue? If so, is it
fixed in a later Tcl version? (It's unlikely that my client will be
willing to upgrade their existing Tcl base, so this is just for my own
reference.)

Thanks much for any light shed on this situation.

Best Regards,
Adrian


======================= test-tcp-cli.tcl ===========================
#!/usr/bin/env bash
# \
exec /export/home/invantest/tools/tcl/8.4.19.20091124/bin/tclsh "$0" "$@"

array set ::config {
-host localhost
-port 5746
-instances 10
}

if { $argc > 0 } {
array set ::config $argv
}

array set ::fp {}
array set ::stats {}

proc Log {msg} {
puts stderr "[clock format [clock seconds] -format {%Y%m%d.%H%M%S}]
$msg"
}

proc ResetStats {i} {
set ::stats(push,${i}) 0
set ::stats(read,${i}) 0
}

proc PrintStats {i} {
Log "REPORT: $i: push $::stats(push,${i}), read $::stats(read,${i})"
ResetStats $i
}

proc ReportStats {i} {
PrintStats $i
after 10000 ReportStats $i
}

proc PushData {i} {
if {[catch {puts $::fp($i) "Test string for $i"; flush $::fp($i)} err]}
{
Log "SENDERR: $i: $err"
} else {
incr ::stats(push,$i)
}
}

proc ReadData {i} {
if {[eof $::fp($i)]} {
Log "DISCONN: $i"
catch {close $::fp($i)}
PrintStats $i
after cancel ReportStats $i
OpenSocket $i
} else {
incr ::stats(read,$i)
gets $::fp($i)
after 1 [list PushData $i]
}
}

proc OpenSocket {i} {
if {[catch {set ::fp($i) [socket $::config(-host) $::config(-port)]}
err]} {
Log "CONNERR: $i: $err"
} else {
Log "CONNECT: $i"
ResetStats $i
after 10000 ReportStats $i
fileevent $::fp($i) read [list ReadData $i]
puts $::fp($i) "CLIENT $i"
after 1 [list PushData $i]
}
}

set prefix [info hostname]

for {set i 0} {$i < $::config(-instances)} {incr i} {
OpenSocket ${prefix}-${i}
}

vwait ::forever
====================================================================

======================= test-tcp-srv.tcl ===========================
#!/usr/bin/env bash
# \
exec /export/home/invantest/tools/tcl/8.4.19.20091124/bin/tclsh "$0" "$@"

array set ::config {
-disconnect_secs 60
-port 5746
}

if { $argc > 0 } {
 
 
 

Hitting weird limit in Tcl 8.4.19

Post by Gerald W. » Thu, 26 Aug 2010 20:15:01

Which OS?



>...

--
+------------------------------------------------------------------------+
| Gerald W. Lester, President, KNG Consulting LLC |
| Email: XXXX@XXXXX.COM |
+------------------------------------------------------------------------+

 
 
 

Hitting weird limit in Tcl 8.4.19

Post by Alexandre » Fri, 27 Aug 2010 06:56:45


That's the well-known problem of the fd_setsize limitation,
effectively putting an upper bound of 1024 on select-based monitoring
(Tcl belongs to this class). So basically, even though you can open
tens of thousands of fds, only 0..1023 will be amenable to fileevent :
(

Various ideas have been proposed to circumvent that, mainly with
epoll(), but nothing concrete has been committed yet, sorry. In the
meantime you can do what I do: use epoll() in a separate process and
delegate IO to it.

-Alex
 
 
 

Hitting weird limit in Tcl 8.4.19

Post by Adrian H » Fri, 27 Aug 2010 15:06:04


Linux all around.

Best Regards,
Adrian
 
 
 

Hitting weird limit in Tcl 8.4.19

Post by Adrian H » Fri, 27 Aug 2010 15:24:52


Thanks Alex! I'd forgotten about the fd_setsize limit myself, having
used poll() in favor of select() for a long time now.

That said, why would this limit b0rk the *Tcl event loop*? At the very
least, I would've expected [fileevent] to throw an error when going over
this limit, rather than a catastrophic failure of all event servicing.


I'll certainly keep that in mind. Thanks again!

Best Regards,
Adrian
 
 
 

Hitting weird limit in Tcl 8.4.19

Post by Donal K. F » Fri, 27 Aug 2010 17:42:41

On 25 Aug, 22:56, Alexandre Ferrieux < XXXX@XXXXX.COM >


I started to write an alternate notifier that uses those more recent
APIs but it's rather difficult coding (well, it is if you're only
working on it very part time) so it got pushed on to my back-burner.
Probably permanently. (It doesn't help that I'm using OSX Leopard at
the moment, where many of the non-select() APIs are actually broken
due to limitations in kqueue() w.r.t. terminals. Snow Leopard fixes
that, but I've not upgraded.)

Donal.
 
 
 

Hitting weird limit in Tcl 8.4.19

Post by Alexandre » Fri, 27 Aug 2010 21:20:19


Yes, I would've expected it too :)
Unfortunately, one minute of source diving shows that one internal
function on the call path through the various abstractions layers ...
is not equipped to report failure :(

void
Tcl_CreateFileHandler(
int fd, /* Handle of stream to watch. */
int mask, /* OR'ed combination of TCL_READABLE,
* TCL_WRITABLE, and TCL_EXCEPTION: indicates
* conditions under which proc should be
* called. */
Tcl_FileProc *proc, /* Function to call for each selected
* event. */
ClientData clientData) /* Arbitrary data to pass to proc. */
;

(and it's stubbed, to make things worse...)

One thing that comes to mind though, is to Tcl_Panic, but I somehow
shied away from proposing the patch since it may break some long-
running application (with an abort()) that formerly "survived" to fds
above 1023 being silent...

-Alex
 
 
 

Hitting weird limit in Tcl 8.4.19

Post by Alexandre » Fri, 27 Aug 2010 21:26:05

On Aug 26, 10:42m, "Donal K. Fellows"



Same here :)
I started coding an hybrid notifier that would start its life in
ol'good select() mode due to its economy of fd's (yes, epoll()
consumes one fd, which means +33% for the simplest processes), and
switch to epoll() once an fd above a given threshold is given to
fileevent (it might even be a good idea to set this threshold much
lower than 1024 for performance reasons, since the select() method is
ridden with linear scans of the masks).
But I never finished (backburner syndrom), since the "external epoll"
was a *much* simpler solution to my real-life needs :/ Selfish, eh ?

-Alex
 
 
 

Hitting weird limit in Tcl 8.4.19

Post by DTM » Sun, 29 Aug 2010 12:18:11


Well, to those of use who write client/server applications for Unix/
Linux this is all very expected.

Things seem to hangs up after you have opened 1024 sockets on the
server. This happens to be the default maximum number of simultaneous,
open sockets on a Linux box. The clients are just waiting for a socket
connection that can't be made since the server ran out of sockets.

The operating system can be reconfigured to allow up to 65536
simultaneous, open sockets. However, most programmers just close the
sockets when they are no longer needed and 1024 ends up being enough
sockets.

Dennis LaBelle
 
 
 

Hitting weird limit in Tcl 8.4.19

Post by Adrian H » Tue, 31 Aug 2010 16:05:14


Actually, the clients are working just fine (i.e. properly failing to
connect). It's the *server* that locked up, specifically its Tcl event
loop. Also, as I mentioned, I'd already bumped the server process's open
FD limit to 2048, and used two clients so that neither one exceeded 1000
open connections.

I'm convinced that this is actually a Tcl bug, but given Alex's
observations elsewhere in this thread, I hold little hope of this being
fixed soon, if at all.

My major concern at this point is whether this event loop issue is
restricted to registering too many file event sources, or can also happen
under some other event-related conditions, as all my client's Tcl
programs (and my own, for that matter) are completely event-driven. Any
in the know care to comment?

Best Regards,
Adrian
 
 
 

Hitting weird limit in Tcl 8.4.19

Post by Alexandre » Tue, 31 Aug 2010 16:30:16


Warning: it is not "registering too many fileevents", it can hit you
even if you open 1021 channels (assuming stdin/out/err are already
there), do absolutely nothing with them, and then open an 1022nd one
and add a single fileevent to it !

Summary: no fd above fd_setsize-1 (==1023) can currently be
fileevent'ed on, period.

Note that it is strictly a select() limitation, and its script-level
impact is mostly restricted to fileevents, though you can imagine
contorted situations where an extension uses an internal fd (eg the X
socket in Tk), but comes to life at a point where all fd slots
[0..1023] are in use, so this fd will be >=1024... Contorted indeed:
extensions mostly do their stuff at init time, well before activity
gets a chance to consume the fd space. It can still be a concern if
the extension creates fds later in its life (eg opening a device on
demand). YMMV.

-Alex
 
 
 

Hitting weird limit in Tcl 8.4.19

Post by Adrian H » Tue, 31 Aug 2010 17:13:29


Oddly enough, I'd just modified my test client to connect but not send
anything, then ran the same test again while strace'ing the server. Sure
enough, here's what strace logged:

5258 15:35:52.064705 gettimeofday({1283153752, 64746}, NULL) = 0
5258 15:35:52.064803 select(1024, [3 4 5 ... 1021 1022 1023], [], [],
{6, 71149}) = 1 (in [3], left {6, 71149})
5258 15:35:52.066910 gettimeofday({1283153752, 66967}, NULL) = 0
5258 15:35:52.067069 accept(3, {sa_family=AF_INET, sin_port=htons
(35598), sin_addr=inet_addr("192.168.1.8")}, [16]) = 1024
5258 15:35:52.067281 fcntl64(1024, F_SETFD, FD_CLOEXEC) = 0
5258 15:35:52.067624 gettimeofday({1283153752, 67672}, NULL) = 0
5258 15:35:52.067887 write(2, "20100830.153552 CONNECT: 192.168"..., 55)
= 55
5258 15:35:52.068081 write(2, "\n", 1) = 1
5258 15:35:52.068199 gettimeofday({1283153752, 68240}, NULL) = 0
5258 15:35:52.068297 select(1025, [3 4 5 ... 1021 1022 1023 1024], [0],
[1024], {6, 67655}) = 2 (in [3 1024], left {6, 67655})
5258 15:35:52.070400 gettimeofday({1283153752, 70456}, NULL) = 0
5258 15:35:52.070732 recv(1024,

Looks like the select() call only manipulated the first 1024 fd bits,
hence the call to recv() on an FD which blocks with nothing to read...

So, it's clearly important to ensure that, until/unless Tcl switches to
poll() and its ilk, forcing a 1024-open-FD limit is the right thing to do.

I now have the unhappy task of asking the client to either re-engineer
their protocol to limit the number of simultaneous connections, or write
a poll()-based proxy. 8-)

Best Regards,
Adrian
 
 
 

Hitting weird limit in Tcl 8.4.19

Post by Alexandre » Tue, 31 Aug 2010 17:33:33


Glad to see you believe me now ;-)
(I agree, firsthand experience is unbeatable)


By the way, do the established connection contexts have to share a
central process ?
If they don't (ie they are independent), then why not just make your
server process an inetd/xinetd delegate ? Or even do all this in HTTP
flavour and make it a CGI, served by Apache (for example) ?
Granted, this means a fork/exec per client, but avoids the dangerous
fd concentration...

-Alex
 
 
 

Hitting weird limit in Tcl 8.4.19

Post by Adrian H » Tue, 31 Aug 2010 18:07:14


I never disbelieved you, but I was concerned that it was not the only way
to b0rk the event loop. My strace session was to get some visibility
about what exactly was happening, and unless I misread the output, it
actually wasn't the event loop that "died", but a blocking read that was
incorrectly triggered due to "undefined behavior" in select().


Yes for performance reasons, but I can't say more without violating their
NDA. 8-)

Best Regards,
Adrian
 
 
 

Hitting weird limit in Tcl 8.4.19

Post by Alexandre » Tue, 31 Aug 2010 18:24:15


Then what about mod_tcl in Apache ? *Much* smaller startup overhead.
Still not fast enough ?

-Alex