DragonFly kernel List (threaded) for 2008-04
DragonFly BSD
DragonFly kernel List (threaded) for 2008-04
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Re: altq spinlock and if_start dispatch


From: "Sepherosa Ziehau" <sepherosa@xxxxxxxxx>
Date: Sun, 13 Apr 2008 14:39:04 +0800

On Sun, Apr 13, 2008 at 1:23 AM, Matthew Dillon
<dillon@apollo.backplane.com> wrote:
> :Hi all,
>
> :
>  :In an experiment I conducted ~one weeks ago, I found out that ifnet
>  :serializer contention (if_output and NIC's txeof) had negative effect
>  :on network forwarding performance, so I created the following patch:
>  :http://leaf.dragonflybsd.org/~sephe/ifq_spinlock.diff3
>  :...
>  :
>  :I considered dispatching outgoing mbuf to NIC's interrupt/polling CPU
>
> :to do the enqueue and if_start thus to avoid the spinlock, but upper
>  :layer, like TCP, processes ifq_handoff error (e.g. ENOBUFS);
>  :dispatching outgoing mbuf will break original semantics.  However, the
>  :only source of error in ifq_handoff() is from ifq_enqueue(), i.e. only
>  :ifq_enqueue() must be called directly on the output path.  Spinlock is
>  :chosen to protect ifnet.if_snd, so ifq_enqueue() and ifnet.if_start()
>  :could be departed. '1)' has one implication that
>  :ifq_poll->driver_encap->ifq_dequeue does not work, but I think driver
>  :could be easily converted to do ifq_dequeue+driver_encap without
>  :packet lossing.  '1)' is the precondition to make '2)' and '3)' work.
>  :'2)' and '3)' together could avoid ifnet serializer contention.  '4)'
>  :is added, since my another experiment shows that serious ipiq overflow
>  :could have very bad impact on overall performance.
>  :
>  :I have gathered following data before and after the patch:
>  :http://leaf.dragonflybsd.org/~sephe/data.txt
>  :http://leaf.dragonflybsd.org/~sephe/data_patched.txt
>  :
>  :boxa --+----> em0 -Routing box- em1 ----> msk0
>  :boxb --+
>  :
>  :boxa (2 x PIII 1G) and boxb (AthlonXP 2100+) each has one 32bit PCI
>  :...
>  :
>  :Fast forwarding is improved a lot in BGL case, probably because the
>
> :time consumed on input path is greatly reduced by if_start
>  :dispatching.
>  :This patch does introduce regression in MP safe case when em0/em1 is
>  :on different CPU than the packet's target CPU on Routing box, may be
>  :caused by ipi bouncing between two CPUs (I didn't find the source of
>  :the problem yet).
>  :Fast forwarding perf drops a lot in MP safe case, if em0 and em1 are
>  :on different CPUs; reducing serializer contention does help (~40Kpps
>  :improvement).  Something still needs to be figured out.
>  :
>  :So please review the patch.  It is not finished yet, but the major
>  :part has been done and I want to call for reviewing before I strand
>  :too far away.
>  :
>  :Best Regards,
>  :sephe
>  :
>  :--
>  :Live Free or Die
>
>     Ok, I think one of the biggest issues is the MP lock.  I'm assuming
>     you were running with kern.intr_mpsafe set to 1 for these tests?

Yeah, I run kern.intr_mpsafe on that box:

enigma:~# ps ax | grep ithread
   -1  ??  BLM    0:00.00  (ithread 10)
   -1  ??  BLM    0:00.00  (ithread 15)
   -1  ??  BLM    0:00.00  (ithread 14)
   -1  ??  BLM    0:00.00  (ithread 5)
   -1  ??  BLM    0:00.01  (ithread 11)
   -1  ??  BL     0:00.00  (ithread 1)
   -1  ??  BL     0:10.45  (ithread 7)
   -1  ??  BLM    0:00.00  (ithread 4)
   -1  ??  BLM    0:00.00  (ithread 64)
   -1  ??  BLM    0:00.00  (ithread 9)
   -1  ??  BLM    0:00.00  (ithread 67)
   -1  ??  BL     0:00.01  (ithread 69)
   -1  ??  BLM    0:00.00  (ithread 66)
   -1  ??  BLM    0:00.00  (ithread 68)
   -1  ??  BLM    0:00.00  (ithread 0)
   -1  ??  BL     0:00.00  (ithread emerg)

Note, ithread 7 is under BGL, which can affect test result:
enigma:~# dmesg | grep "irq 7"
ppc0 port 0x3bc-0x3bf irq 7 on acpi0
enigma:~# vmstat -i | grep ppc
ppc0                      1836642       3636


When MP safe tests were conducted:
enigma:~# ps ax | grep netisr
   -1  ??  BLM    0:22.31  (netisr_cpu 0)
   -1  ??  BLM    0:27.75  (netisr_cpu 1)
  795  p0  RL+    0:00.00 grep netisr
enigma:~# ps ax | grep udp_thread
   -1  ??  BLM    0:00.00  (udp_thread 1)
   -1  ??  BLM    0:00.00  (udp_thread 0)
  797  p0  RL+    0:00.00 grep udp_thread

Since all generated packets were UDP packets (go through udp_thread)
and two em used polling(4) (run in netisr thread), I think the
forwarding path is MP safe.

>
>     The way to determine whether the MP lock is creating a problem or
>     not is to use the KTR logging facility and log all lock contention
>     (and maybe even add additional KTR facilities to track exactly what
>     the packet path is doing.
>
>     None of the tests are going to give us real results unless the MP
>     lock is completely out of the test path.  That should be the #1
>     issue before we start splitting the serializer locks.   I suspect
>     some of your IPI bouncing issues may be due to the MP lock.

Following data made me think serializer contention meight be the cause
of fast forwarding performance drop in MP safe case:
hw.em1.serializer_sleep: 635969
hw.em1.serializer_tryfail: 2402027
hw.em1.serializer_enter: 7084821
hw.em1.serializer_try: 5077636

30 seconds, 10bytes UDP datagram.  em0 on cpu0 and em1 on cpu1.  em0
only did input work, em1 only did output work.
Since fast forwarding does output work on input path, so em1's
serializer contention (if_output[CPU0] <--> txeof[CPU1]) was
unavoidable (before patching) in this configuration.

These data were gathered with SERIALIZER_PROFILE in kernel configuration and
kern.intr_mpsafe="1"
netisr.mpsafe_thread="1"
net.inet.udp.mpsafe_thread="1"

In BGL case, there was no serializer contention, i.e.
hw.em1.serializer_sleep and hw.em1.serializer_tryfail were 0

While with the patch (i.e. eliminate serializer contention), thing
improves a little (not as much as I imagined :P), so there were
something other than serializer contention played a more important
role.  Yeah, KTR can help to find it out :)

>
>  :The ideas behind this patch are:
>
> :1) altq is protected by its own spinlock instead of ifnet's serializer.
>
>     I like the idea.
>
>  :2) ifnet's serializer is pushed down into each ifnet.if_output
>
> :implementation, i.e. if_output is called without ifnet's serializer
>  :being held
>
>     I agree that having just one serializer for a network device driver
>     is not enough.  Ultimately I think we do want two or three
>     serializers:  One for the interrupt (RX packet handling and TX
>     packet cleanup after xmit has completed), one for transmit packet
>     queueing, and possibly one for the control path (ifconfig etc).
>     This last one doesn't really need to exist, the kernel could
>     instead acquire both of the other two serializers when doing control
>     operations.
>
>     So, yes, the TX packet queue to interface mechanic would definitely
>     have its own serializer.
>
>     But I think the only way to do this cleanly is to have the callers
>     into the API acquire the appropriate serializers instead of forcing
>     each individual network driver to acquire them.  Remember how easy it

Nah, if_output is the generic one (ether_output) for all of the real
hardware network drivers.  For ether_output(), all code before
ifnet.if_start() do not need ifnet's serializer, if we dispatch
if_start then whole ether_output will not need ifnet serializer.  14
pseudo devices use their own if_output and need to hold serializer.
After investigating all if_output calling and their implementation, I
think pushing down ifnet serializer holding into if_output
implementation (note: not each driver :) is as easy as holding ifnet
serializer by callers.  By pushing down ifnet serializer holding, we
could also completely eliminate ifnet serializer holding cost, if the
if_output implementation does not need to hold ifnet serializer (e.g.
ether_output after patching and other 6 pseudo devices)

>     was for me to implement the original serializer plan?  I want things
>     to continue to be that easy.
>
>  :3) ifnet.if_start is dispatched to the CPU where NIC's interrupt is
>
> :handled or where polling(4) is going to happen
>
>     This shouldn't be necessary.  This sounds like an issue that the MP
>     lock might have been effecting.  Acquiring the TX packet interface
>     enqueue serializer ought to be enough and then simply directly call
>     the entry point (and also using your if_start trick below).

In addition to avoiding if_start/if_output serializer contention
between CPUs, the if_start dispatching could also avoid serializer
contention between if_start/if_output and driver's txeof+if_start
processing.

Best Regards,
sephe

>
>  :4) ifnet.if_start's ipi sending is avoided as much as possible, using
>
> :the same mechanism as avoiding real ipi sending in ipiq implementation
>
>     I like this part of the patch.  Definitely a good idea.
>
>                                         -Matt
>                                         Matthew Dillon
>                                         <dillon@backplane.com>
>



-- 
Live Free or Die



[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]