From: | Rob Schmuloff <rrschoolie@xxxxxxxxx> |
Date: | Sat, 26 Jun 2004 20:02:33 -0700 (PDT) |
Hello Matt, First apologies for the poor formatting of this email... I have some interesting information regarding the postfix problem. It seems to affect a few of the postfix daemons: smtpd, local, and cleanup. The debugging output seems to indicate a race condition between two processes calling flock and requesting an exclusive lock on the same file. When the machine wedges the network stack still seems to be working ( i.e. TCP connect, and pings), but the console is locked up and nothing seems to be running in userland. I can reproduce this probelm by receiving email from the freebsd-current mailing list ( not sure why other than the high volume of mail, and their server is running postfix too). The debugging output from the lockf code (hand copied): pid 5831 (cleanup) lf_destroy_range: -2401050962867404578..-2401050962867404578 pid 5831 (cleanup) lf_create_range: 0..9223372036854775807 pid 5832 (cleanup) lf_destroy_range: 0..9223372036854775807 pid 5832 (cleanup) lockf 0xd7c22bd4 0..9223372036854775807 type exclusive owned by -1 blocked locks 9223372036854775807 type exclusive waiting on 0xc0861ec0 pid 5832 (cleanup) lf_destroy_range: -2401050962867404578..-2401050962867404578 pid 5832 (cleanup) lf_create_range: 0..9223372036854775807 pid 5831 (cleanup) lf_destroy_range: 0..9223372036854775807 pid 5831 (cleanup) lockf 0xd7c22bd4 0..9223372036854775807 type exclusive owned by -1 blocked locks 9223372036854775807 type exclusive waiting on 0xc0861ec0 I wrote a quick prog that seems to replicate the problem if run with stdout/stderr redirected to /dev/null ( or just comment out the printf's) /*---------------------------------------------------*/ #include <unistd.h> #include <stdlib.h> #include <fcntl.h> #include <sys/file.h> #include <sys/types.h> int main() { int fd,status; int ops[4]={ LOCK_SH, LOCK_EX, LOCK_SH|LOCK_NB, LOCK_EX|LOCK_NB}; char *opstr[4] ={"LOCK_SH", "LOCK_EX", "LOCK_EX|LOCK_NB", "LOCK_SH|LOCK_NB"}; pid_t pid, i; fork(); fork(); pid=getpid(); /* i = pid % 2; */ i = 1; printf("pid %d i = %d\n",pid,i); for (;;){ fd = open("testfile",O_RDWR|O_CREAT); while ( status=flock(fd, ops[i]) ) usleep( rand()/20000); printf("PID %d got %s lock -- %d\n",pid,opstr[i],status); usleep(rand()/20000); status=flock(fd, LOCK_UN); printf("PID %d released lock -- %d\n",pid,status); usleep(rand()/20000); close (fd); } } --- Matthew Dillon <dillon@xxxxxxxxxxxxxxxxxxxx> wrote: > Rob, I have committed a change to > kern/kern_lockf.c that puts a very > short wait in the lockf retry loop in an attempt > to prevent the system > from locking up when it gets into this livelock. > > I don't think this will actually fix the problem > (or at least I hope > it doesn't). I am instead hoping that it will > be possible to ktrace > the processes involved and/or otherwise track > the problem down when > it occurs without the whole machine going down. > > -Matt > > > :Hello, > : > : I've experienced periodic hangs on my system for > :sice mid-May. Lately, my terminal is unresponsive > as > :soon as I start postfix and receive incoming mail. > :When I drop into DDB, the stack trace shows: > : > :scgetc() > :sckbdevent() > :atkbd_intr() > :atkbd_isa_intr() > :intr_mux() > :ithread_handler() > :(Perhaps this is ctl-alt-esc trace) > : > :This is with a kernel/world built June 23rd. I > :*think* the problem is somewhere in the lockf code > :because I glanced at 'ps' from DDB. Also, I'm using > :procmail for local delivery, so that's also a > :possibility. Sorry for the sparse information. > I'm > :trying to get more data for you.. > : > :Thanks, > : > :Rob > __________________________________ Do you Yahoo!? New and Improved Yahoo! Mail - 100MB free storage! http://promotions.yahoo.com/new_mail
Attachment:
tt.c
Description: tt.c