Hunting down a fd closing bug in Samba
In Samba I had a failing test suite. I have nss_wrapper compiled with debug messages turned on, so it showed me the following line:
NWRAP_ERROR(23052) - nwrap_he_parse_line: 3 Invalid line[TDB]: 'DB'
The file should parse a hosts file like /etc/hosts, but the debug line showed that it tried to parse a TDB (Trivial Database) file, Samba database backend. I’ve started to investigate it and wondered what was going on. This morning I called Michael Adam and we looked into the issue together. It was obvious that something closed the file descriptor for the hosts file of nss_wrapper and the samba binary opend a different file getting the same fd number assigned. The big question was, what the heck closes the fd. As socket_wrapper was loaded and it wraps the open() and close() call we started to add debug to the socket_wrapper code.
So first we added debug statements to the open() and close() calls to see when the fd was opened and closed. After that we wanted to see a stacktrace at the close() call to see what is the code path were it happens. Here is the code how to do this:
commit 6c632a4419b6712f975db390145419b008442865 Author: Andreas Schneider <> AuthorDate: Thu Mar 26 11:07:38 2015 +0100 Commit: Andreas Schneider <> CommitDate: Thu Mar 26 11:07:59 2015 +0100 DEBUG stacktrace --- src/socket_wrapper.c | 37 +++++++++++++++++++++++++++++++++---- 1 file changed, 33 insertions(+), 4 deletions(-) diff --git a/src/socket_wrapper.c b/src/socket_wrapper.c index 1188c4e..cb73cf2 100644 --- a/src/socket_wrapper.c +++ b/src/socket_wrapper.c @@ -80,6 +80,8 @@ #include <rpc/rpc.h> #endif +#include <execinfo.h> + enum swrap_dbglvl_e { SWRAP_LOG_ERROR = 0, SWRAP_LOG_WARN, @@ -303,8 +305,8 @@ static void swrap_log(enum swrap_dbglvl_e dbglvl, switch (dbglvl) { case SWRAP_LOG_ERROR: fprintf(stderr, - "SWRAP_ERROR(%d) - %s: %s\n", - (int)getpid(), func, buffer); + "SWRAP_ERROR(ppid=%d,pid=%d) - %s: %s\n", + (int)getppid(), (int)getpid(), func, buffer); break; case SWRAP_LOG_WARN: fprintf(stderr, @@ -565,10 +567,35 @@ static int libc_bind(int sockfd, return swrap.fns.libc_bind(sockfd, addr, addrlen); } +#define BACKTRACE_STACK_SIZE 64 static int libc_close(int fd) { swrap_load_lib_function(SWRAP_LIBC, close); + if (fd == 21) { + void *backtrace_stack[BACKTRACE_STACK_SIZE]; + size_t backtrace_size; + char **backtrace_strings; + + SWRAP_LOG(SWRAP_LOG_ERROR, "fd=%d", fd); + + backtrace_size = backtrace(backtrace_stack,BACKTRACE_STACK_SIZE); + backtrace_strings = backtrace_symbols(backtrace_stack, backtrace_size); + + SWRAP_LOG(SWRAP_LOG_ERROR, + "BACKTRACE %lu stackframes", + (unsigned long)backtrace_size); + + if (backtrace_strings) { + size_t i; + + for (i = 0; i < backtrace_size; i++) { + SWRAP_LOG(SWRAP_LOG_ERROR, + " #%lu %s", i, backtrace_strings[i]); + } + } + } + return swrap.fns.libc_close(fd); } @@ -704,6 +731,8 @@ static int libc_vopen(const char *pathname, int flags, va_list ap) fd = swrap.fns.libc_open(pathname, flags, (mode_t)mode); + SWRAP_LOG(SWRAP_LOG_ERROR, "path=%s, fd=%d", pathname, fd); + return fd; }
We found out that the code responsible for this created a pipe() to communitcate with the child and then forked. The child called close() on the second pipe file descriptor. So when another fork happend in the child, the close() on the pipe file descriptor was called again and we closed a fd of the process to a tdb, connection or something like that. So initializing the pipe fd array with -1 and only calling close() if we have a file description which is not -1, fixed the problem.
If you need a better stacktrace you should use libunwind. However socket_wrapper can be a nice little helper to find bugs with file descriptors 😉
BUG: Samba standard process model closes random files when forking more than once