DragonFly BSD
DragonFly users List (threaded) for 2012-09
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Recent MMU, Cached I/O, and Scheduler work now in master


From: Matthew Dillon <dillon@xxxxxxxxxxxxxxxxxxxx>
Date: Tue, 18 Sep 2012 17:07:32 -0700 (PDT)

    One of our wonderful GSOC projects this summer resulted in Mihai
    Carabas adding a cpu topology awareness framework to DragonFly and
    doing work on the scheduler to make it more topology aware.

    This started several of us on a performance benchmarking binge with
    several people, particularly Francois Tigeot running postgres/pgbench
    tests on a 12x2 (24 thread) Xeon box and me running tests on a smaller
    4x2 (8 thread) Xeon box and our larger 48-core opteron box.

    In the last month the master branch has gone through some radical
    changes.  All the work is in but some still experimental and requires
    a sysctl to turn on.

    * PMAP MMU optimizations for 64-bit systems.  We noticed that when
      postgres servers are used with very large shared memory areas,
      either with SYSV SHM or MMAP, each postgres server process (which
      fork instead of thread) has to fault-in tens of thousand of pages.

      When you multiple by a potentially large number of postgres server
      processes this turns into millions of faults.  In addition, each
      process is maintaining its own complete copy of the page table.

      This optimization works for SYSV SHM as well as any large shared or
      read-only mmap of anonymous or file-backed data.  The optimization
      causes the actual page table pages themselves to be cached in the
      backing VM object (thus not subject to destruction when processes
      using the mappings fork() or exit), and the individual MMU maps for
      each process actually share the page tables by mapping shared page
      table pages.  This removes nearly ALL page faults from a warmed-up
      postgres server, even if there are hundreds of postgres server
      processes forked and even when it does fresh fork()s.  In addition,
      most of the page tables for these processes are now shared (even
      though they were forked and not threaded), thus making far better
      use of cpu memory caches.

	sysctl machdep.pmap_mmu_optimize=1

	(still experimental)

    * Read shortcut through the VM system (integrated w/HAMMER for now).
      This doubles the performance of read() system calls from the cache
      which would otherwise cause the buffer cache to cycle (when the VM
      page cache is big enough to cache the data set but the buffer cache
      is not).  In this situation the cycling of the buffer cache causes
      a large number of SMP MMU invalidations due to the constant adjusting
      of VM pages mappings in kernel memory.

      With this shortcut cached file data read with read() is copied out
      using the DMAP instead of the buffer cache, not only improving read()
      performance but also significant improving all activities on
      multi-core systems due to the reduced kernel page smashing.

	sysctl vm.read_shortcut_enable=1

	(still experimental)

    * Scheduler rewrite.  Mihai Carabas made large strides in scheduler
      performance on larger servers with his cpu topology awareness framework
      and his work on our user thread scheduler.  However, there were still
      significant limitations in the scheduler due to its original design.

      The original scheduler was essentially single-threaded, using a global
      spinlock to protect a single global scheduling run queue.  This lead
      to a number of SMP related bottlenecks with the scheduler as well as
      complicated the algorithms.

      I have now completed a rewrite of the scheduler that incorporates
      Mihai's cpu topology infrastructure and rewrites the algorithms to
      utilize the new scheduler framework.

      The new scheduler utilizes per-cpu queues and fine-grained per-cpu
      spinlocks.  There are no global spinlocks, removing that bottleneck.

      The new scheduler rewrites the cpu topology algorithms to implement
      a top-down (whole-machine -> socket -> core -> hyperthread) scheduling
      implementation, performing three major algorithmic actions:

      (1) It generates a load factor at all levels and load-balances the
	  assignment of processes to cpus in a topology-aware framework.
	  This means that if you have 4 processes running in a 4x2 (8-thread)
	  environment, they will be scheduled to cores and not to competing
	  hyperthreads.  If you have two cpu sockets and two processes, one
	  will be scheduled to each socket to make best use of their caches.

      (2) It will try to avoid migrating processes when possible, and when
	  not possible it will try to keep them nearby from a
	  topological standpoint.

      (3) It will detect process block/wakeup events which e.g. tie two
	  processes together, and will try to move the process pairs closer
	  to each other using that information.

	  For example, if you have many postgres clients and servers on a
	  large server, enough to load down all cores, the client and
	  server pairs will be localized to the same socket, thus making
	  use of chip caches to facilitate communications between the two
	  processes.

		(the scheduler changes are now the default in master)

    * Finally, default values for many things on 64-bit machines have been
      adjusted upward significantly to make proper use of available
      resources.  There were numerous caps that had been inherited from the
      32-bit code that are now gone or greatly raised, particular with
      regards to SYSV shared memory and the buffer cache.

    The result is an IMMENSE improvement in postgres benchmarks as well as
    across-the-board improvements in performance under load.  We pretty
    much outstrip the other BSDs now and we get fairly close (though do
    not quite beat) the higher-end linux benchmarks.

    In addition, the new scheduler algorithms effect many other system
    activities, such as source code builds (which make heavy use of pipes),
    web servers, and even interactive vs batch processing.

    Francois will post updated graphs today or tomorrow showing the immense
    progress we've made.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>



[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]