DragonFly kernel List (threaded) for 2010-02
[
Date Prev][
Date Next]
[
Thread Prev][
Thread Next]
[
Date Index][
Thread Index]
kernel work week of 3-Feb-2010 HEADS UP - TESTING & WARNINGS
WARNING WARNING WARNING
On the work that has been and continues to be committed to the
development branch. With regards to the vm.swapcache sysctls.
These features are HIGHLY EXPERIMENTAL. If you turn on the swapcache
by setting vm.swapcache.{read,meta,data}_enable to 1 (read_enable being
the most dangerous since that actually turns on the intercept), you
risk losing EVERY SINGLE FILESYSTEM mounted RW to corruption.
I want to be very clear here. The swap cache overrides vn_strategy()
reads from the filesystem and gets the data from the swap cache instead.
It will do this for regular file data AND (if enabled) for meta-data.
Needless to say if it gets it wrong and the filesystem modifies and
writes back some of that meta-data the filesystem will blow and the
media will be corrupted. It has been tested for, well, less than a day
now. Anyone using these options needs to be careful for the next few
weeks.
If you do not enable any of these sysctls you should be safe.
People who wish to test the swap cache should do so on machines they
are willing to lose the ENTIRE machine's storage to corruption. The
swap cache operates system-wide. Any direct-storage filesystem (aka
UFS or HAMMER) is vulnerable. NFS is safer but data corruption is
still possible.
vm.swapcache.read_enable Very dangerous, enables intercept.
Moderately dangerous if meta_enable
is turned off.
vm.swapcache.meta_enable Very dangerous.
vm.swapcache.data_enable Moderately dangerous.
VMs and vkernels are recommended if you do not have a dedicated machine
and dedicated HD and SSD to test with, if you want to play with it.
But, of course, these features are designed for SSD swap and you will
not be able to do any comparitive tests unless you have SSD swap and
a normal HD (or NFS) for your main filesystem(s). And, as it currently
stands, a 15K rpm HD hasn't been worked in yet. Currently writes are
very fragmented. So it is SSD swap or nothing, basically.
You are not going to see any improvement unless you actually have a SSD.
WARNING WARNING WARNING
---
Ok, that said, there is still a ton of work I have to do. I am not
doing any write-clustering yet and I am not doing any proactive disposal
of stale swap cache data to make room for new data yet. Vnode recycling
can cause swap cache data to be thrown away early, as well. I expect
to add improvements in the next week and a half or so. So keep that in
mind.
Also note that the write rate is limited... the initial write rate
is also limited by vm.swapcache.maxlaunder so do not expect the data
you are reading from your HD at 60MB+/sec to all get cached to SSD
swap on the first pass. The current algorithms are very primitive.
With all these caveats, the basic functionality is now operational.
The commit message details how the sysctls work:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/c504e38ecd4536447026bf29e61a391cd2340ec3
You have fine control over write bursting and the long-term average
write bandwidth via sysctl. The defaults are fairly generous.
Currently since there is no proactive recycling of ultra stale swap
(short of vnode recycling), the only testing that can really be done
is with data sets smaller than 2/3 swap space and larger than main
memory.
Default maximum swap on i386 is 32G.
Default maximum swap on x86_64 is 512G.
The default maximum swap on i386 can be changed with the kern.maxswzone
loader tunable. This is a KVM allocation, approximate one megabyte to
one gigabyte. So e.g. kern.maxswzone=64m would allow you to configure
up to ~64G of swap. The problem on i386 is the limited KVM.
On x86_64 you can configure up to 512G of swap by default.
---
Sample test:
* md5 6.6G test file on machine w/ 3G of ram, in a loop. This is on
my test box, AHCI driver, Intel 40G SSD (SATA INTEL SSDSA2M040 2CV1).
16G of swap configured.
-rw-r--r-- 1 root wheel 6605504512 Feb 4 15:42 /usr/obj/test4
MD5 (test4) = aed3d9e3e1fe34620f40e4f9cb0dbcda
15.344u 5.272s 2:19.28 14.7% 83+93k 8+0io 4pf+0w
15.194u 5.788s 2:05.37 16.7% 79+88k 6+0io 2pf+0w
(1G initial swap burst exhausted)
(write rate now limited to 1MB/s)
15.459u 5.861s 2:04.82 17.0% 76+85k 6+0io 2pf+0w
15.318u 6.194s 2:03.70 17.3% 82+92k 6+0io 6pf+0w
15.286u 5.960s 2:01.09 17.5% 95+106k 4+0io 2pf+0w
15.321u 6.179s 1:59.48 17.9% 80+90k 4+0io 4pf+0w
15.391u 5.687s 1:58.71 17.7% 81+91k 6+0io 4pf+0w
(set curburst and maxburst to 10G (10000000000))
(write rate limited by vm.swapcache.maxlaunder, set to 1024)
(write rate to SSD is approximately 4-8MB/sec)
15.181u 6.437s 1:53.42 19.0% 82+92k 6+0io 2pf+0w
15.276u 5.891s 1:42.72 20.5% 82+92k 6+0io 2pf+0w
15.581u 5.774s 1:31.11 23.4% 81+91k 4+0io 0pf+0w
15.643u 6.062s 1:27.76 24.7% 81+90k 4+0io 0pf+0w
(SSD now doing about 50-100MB/sec, mostly reading)
(HD now doing about 15-30MB/sec reading)
(5G now cached in the SSD)
14.910u 5.477s 1:15.48 27.0% 86+97k 6+0io 6pf+0w
15.182u 5.633s 1:21.64 25.4% 82+92k 4+0io 0pf+0w (glitch)
14.762u 5.712s 1:12.13 28.3% 87+97k 6+0io 2pf+0w
14.932u 5.804s 1:16.70 27.0% 84+94k 4+0io 0pf+0w
(HD activity now sporatic, but has bursts of 20-30MB
occasionally)
15.183u 5.625s 1:09.28 30.0% 85+95k 6+0io 4pf+0w
15.245u 5.648s 1:12.79 28.6% 83+93k 4+0io 0pf+0w
15.332u 5.852s 1:08.02 31.1% 80+90k 4+0io 0pf+0w
15.505u 5.712s 0:59.95 35.3% 85+96k 6+0io 4pf+0w
(HD activity mostly 0, but still has some activity)
15.521u 5.485s 0:59.20 35.4% 81+91k 4+0io 2pf+0w
15.381u 5.334s 0:54.01 38.3% 84+94k 6+0io 2pf+0w
16.022u 5.455s 0:50.13 42.8% 78+88k 4+0io 0pf+0w
15.702u 5.345s 0:50.16 41.9% 77+86k 6+0io 2pf+0w
(HD actiivty now mostly 0, SSD reading 120-140MB/sec,
no SSD writing)
15.850u 5.243s 0:50.12 42.0% 82+92k 6+0io 2pf+0w
15.397u 5.337s 0:50.21 41.2% 82+92k 4+0io 0pf+0w
That appears to be steady state. The SSD is doing around
3000 tps, 90-100% busy, 120-140MB/sec reading continuously.
Average data rate 6.6G over 50 seconds = 132MB/sec.
test28:/archive# pstat -s
Device 1K-blocks Used Avail Capacity Type
/dev/da1s1b 16777088 6546900 10230188 39% Interleaved
So that essentially proves that it's doing something real.
Next up I am going to work on some write clustering and ultra-stale
data recycling.
-Matt
[
Date Prev][
Date Next]
[
Thread Prev][
Thread Next]
[
Date Index][
Thread Index]