DragonFly kernel List (threaded) for 2008-07
DragonFly BSD
DragonFly kernel List (threaded) for 2008-07
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Re: [Tux3] Comparison to Hammer fs design


From: Matthew Dillon <dillon@xxxxxxxxxxxxxxxxxxxx>
Date: Sun, 27 Jul 2008 14:31:50 -0700 (PDT)

:When a forward commit block is actually written it contains a sequence
:number and a hash of its transaction in order to know whether the
:...
:Note: I am well aware that a debate will ensue about whether there is
:any such thing as "acceptable risk" in relying on a hash to know if a
:commit has completed.  This occurred in the case of Graydon Hoare's
:Monotone version control system and continues to this day, but the fact
:is, the cool modern version control systems such as Git and Mercurial
:now rely very successfully on such hashes.  Nonetheless, the debate
:will keep going, possibly as FUD from parties who just plain want to
:use some other filesystem for their own reasons.  To quell that
:definitively I need a mount option that avoids all such commit risk,
:perhaps by providing modest sized journal areas salted throughout the
:volume whose sole purpose is to record log commit blocks, which then
:are not forward.  Only slightly less efficient than forward logging
:and better than journalling, which has to seek far away to the journal
:and has to provide journal space for the biggest possible journal
:transaction as opposed to the most commit blocks needed for the largest
:possible VFS transaction (probably one).

    Well, you've described both sides of the debate quite well.  I for
    one am in the 0-risk camp.  ReiserFS got into trouble depending on
    a collision space (albeit a small one).  I did extensive testing of
    64 bit collision spaces when I wrote Diablo (A USENET news system
    contemporary with INN, over 10 years ago) and even with a 64 bit space
    collisions could occur quite often simply due to the volume of article
    traffic.

    I think the cost can be reduced to the point where there's no need
    to allow any risk at all.  After all, we're only talking about meta-data
    updates here for the most part.  Even under heavy loads it can take
    30 seconds for enough dirty meta-data to build up to require a flush.
    One can flush the mess, then seek/update the volume header.

    Or doing something like using a fixed LOG area, allowing it to be
    preformatted... that would remove 100% of the risk and not require
    a volume header update (drafting note: edited up from further below).

    I know volume headers seem old-school.  It was a quick and dirty
    solution in HAMMER that I will be changing.

:Actually, the btree node images are kept fully up to date in the page
:cache which is the only way the high level filesystem code accesses
:them.  They do not reflect exactly what is on the disk, but they do
:reflect exactly what would be on disk if all the logs were fully
:rolled up ("replay").

    Ok, this is simpler.  Hmm.  That brings up the question with regards
    to data allocations and the log.  Are you going to use a fixed log
    space and allocate data-store separately or is the log a continuously
    running pointer across the disk?

    My understanding is it is a continuously running space.

:A forward log that carries the edits to some dirty cache block pins
:that dirty block in memory and must be rolled up into a physical log
:before the cache block can be flushed to disk.  Fortunately, such a
:rollup requires only a predictable amount of memory: space to load
:enough of the free tree to allocate space for the rollup log, enough
:...

    Ok, here I spent about 30 minutes constructing a followup but then
    you answered some of the points later on, so what I am going to do
    is roll it up into a followup on one of your later points :-)

:One traditional nasty case that becomes really nice with logical
:forward logging is truncate of a gigantic file.  We just need to
:commit a logical update like ['resize', inum, 0] then the inode data
:truncate can proceed as convenient.  Another is orphan inode handling
:where an open file has been completely unlinked, in which case we
:log the logical change ['free', inum] then proceed with the actual
:delete when the file is closed or when the log is replayed after a
:surprise reboot.
:...
:Logical log replay is not idempotent, so special care has to be taken
:on replay to ensure that specified changes have not already been

    Wait, it isn't?  I thought it was.   I think it has to be because
    the related physical B-Tree modifications required can be unbounded,
    and because physical B-Tree modifications are occuring in parallel the
    related physical operations cause the logical operations to become
    bound together, meaning the logical ops *cannot* be independantly
    backed out

    Here's an example:

    rm a/b/c	<---- 1TB file
    rmdir a/b
    rmdir a

    There are three problems here.

    First, the logical log must be idempotent, or at least mostly
    idempotent, because the physical portions of the three separate
    logical transactions will complete at different times.

    Second, the logical log entry for "rm a/b/c" cannot be destroyed
    (due to log cycling based on available free space) until after the
    related physical operation has completed, which could occur seconds
    to minutes later.

    The second issue could become a big issue for you because it means
    the log space required will become unbounded or nearly unbounded.
    If the remove requires 10's to 100's of megabytes of physical log
    entries, all appending to the log, and the physical blocks backing
    the log cannot be recovered due to that 'rm a/b/c' needing to remain
    in the log until the physical operation is complete, then I think you
    have a problem.  You would probably need to special-case it.

    A third issue effecting the amount of free space required to complete
    an operation is going to be the worst case physical log space required
    for a single low-level B-Tree transaction, which will roughly be 
    related to the depth of the B-Tree, multiplied by the number of modifying
    B-Tree operations being done in parallel (e.g. from many individual
    processes).

:...
:ugly or unreliable when there are lot of stacked changes.  Instead I
:introduce the rule that a logical change can only be applied to a known
:good version of the target object, which promise is fullfilled via the
:physical logging layer.
:...
:then apply the logical changes to it there.  Where interdependencies
:exist between updates, for example the free tree should be updated to
:reflect a block freed by merging two btree nodes, the entire collection
:of logical and physical changes has to be replayed in topologically
:sorted order, the details of which I have not thought much about other
:than to notice it is always possible.

    I came up with a really simple solution to the freemap-reuse issue
    which you could apply to allow such replays to be idempotent.  You
    simply do not allow a freed block to be reallocated until after all
    the information related to the logical operation that freed it has
    been solidly committed.  Thus you can replay the low level frees
    without accidently re-freeing a block that has been reallocated.

    In HAMMER I delay reuse of freed blocks because UNDO's are only for
    meta-data, not file data, and thus freed data blocks cannot be reused
    until after the related freeing operations has been completely
    committed so crash recovery can do the rollback.  You could use them
    to make otherwise non-idempotent operations idempotent.

:When replay is completed, we have a number of dirty cache blocks which
:are identical to the unflushed cache blocks at the time of a crash,
:and we have not yet flushed any of those to disk.  (I suppose this gets
:interesting and probably does need some paranoid flushing logic in
:replay to handle the bizarre case where a user replays on a smaller
:memory configuration than they crashed on.)  The thing is, replay
:returns the filesystem to the logical state it was in when the crash
:happened.  This is a detail that journalling filesystem authors tend
:to overlook: actually flushing out the result of the replay is
:pointless and only obscures the essential logic.  Think about a crash
:during replay, what good has the flush done?

    Well, you have to flush at some point anyway because you only have
    so much kernel memory to play with.... You do not have enough kernel
    memory to cache the dirty buffers for all the physical activity related
    to a particular logical operation (such as truncating a 1TB file).

:I do not see why this example cannot be logically logged in pieces:
:
:   ['new', inum_a, mode etc] ['link', inum_parent, inum_a, "a"]
:   ['new', inum_b, mode etc] ['link' inum_a, inum_b, "b"]
:   ['new', inum_c, mode etc] ['link' inum_a_b, inum_c, "c"]
:   ['new', inum_x, mode etc] ['link' inum_a_b, inum_x, "x"]
:   ['new', inum_d, mode etc] ['link' inum_a_b_c, inum_d, "d"]
:
:Logical updates on one line are in the same logical commit.  Logical
:allocations of blocks to record the possibly split btree leaves and new
:allocations omitted for clarity.  The omitted logical updates are
:bounded by the depth of the btrees.  To keep things simple, the logical
:log format should be such that it is impossible to overflow one commit
:block with the updates required to represent a single vfs level
:transaction.

    Hmm.  Maybe.  This helps you identify dependancies that must be
    replayed, but it might be easier just to make the entire logical
    log idempotent and just replay everything.  I think the only gotcha
    there is handling link counts.  That would still be far less complex
    then the alternative.

:I suspect there may be some terminology skew re the term "transaction".
:Tux3 understands this as "VFS transaction", which does not include
:trying to make an entire write (2) call atomic for example, but only
:such things as allocate+write for a single page cache page or
:allocate+link for a sys_link call.  Fsync(2) is not a transaction but
:a barrier that Tux3 is free to realize via multiple VFS transactions,
:which the Linux VFS now takes care of pretty well now after many years
:of patching up the logic.
    
    I kinda mix them up.  I think of them as transactions at different
    levels of granularity.

:...
:the resulting blocks are logged logically, but linked into the parent
:btree block using a logical update stored in the commit block of the
:physical log transaction.  How cool is that?

    That is definitely cool.

:transactions are not just throwaway things, they are the actual new
:data.  Only the commit block is discarded, which I suppose will leave
:a lot of one block holes around the volume, but then I do not have to
:require that the commit block be immediately adjacent to the body of
:the transaction, which will allow me to get good value out of such
:holes.  On modern rotating media, strictly linear transfers are not
:that much more efficient than several discontiguous transfers that all
:land relatively close to each other.

    True enough.  Those holes will create significant fragmentation
    once you cycle through available space on the media, though.

:>     So your crash recovery code will have to handle=20
:>     both meta-data undo and completed and partially completed transaction=
:s.
:
:Tux3 does not use undo logging, only redo, so a transaction is complete
:as soon as there is enough information on durable media to replay the
:redo records.

    Ok, that works.  Be careful w/ your physical logging, it can get
    expensive.  I think it may be more expensive then you are currently
    estimating.

    One of the reasons why HAMMER uses an UNDO log is that it is a lot
    less expensive then a REDO log.  I'll show you why:

    Physical operations related to a single flush group in HAMMER:

	modify element in B-Tree node A	(create UNDO w/ original contents of
					 B-Tree node A)
	modify element in B-Tree node A	(no new UNDO needs to be created)
	modify element in B-Tree node A	(no new UNDO needs to be created)
	modify element in B-Tree node A	(no new UNDO needs to be created)
	modify element in B-Tree node A	(no new UNDO needs to be created)

    (NOTE: I do realize that the REDO log can be compressed just as the
    UNDO one, by recording actual data shifts from B-Tree insertions
    and deletions, it gets really complex when you do that, though, and
    would not be idempotent).

    See the advantage?  Many B-Tree operations working on the same node,
    or revisiting that node later on, within the same flush group, do not
    have to lay down more then one UNDO record for that B-Tree node.

    The physical ops are modifying already modified space so I don't
    have to record each op in the UNDO log.  Upon crash recovery the whole
    mess is undone by that first UNDO record.

    I just run the UNDOs backwards, which allows me to cache the in-flush
    undo areas with a fixed set of small throw-away tracking structures
    in system memory.  If a structure gets thrown away it just means an
    extranious UNDO record will be appended to the UNDO FIFO.

:Yes.  Each new commit records the sequence number of the oldest commit
:that should be replayed.  So the train lays down track in front of
:itself and pulls it up again when the caboose passes.  For now, there
:is just one linear sequence of commits, though I could see elaborating
:that to support efficient clustering.

    Yah.  I see.  Noting that issue I brought up earlier about the
    "rm a/b/c".  Locking the caboose of your log until all related physical
    operations have been completed could create a problem.

:One messy detail: each forward log transaction is written into free
:space wherever physically convenient, but we need to be sure that that
:free space is not allocated for data until log rollup has proceeded
:past that transaction.  One way to do this is to make a special check
:against the list of log transactions in flight at the point where
:extent allocation thinks it has discovered a suitable free block, which
:is the way ddsnap currently implements the idea.  I am not sure whether
:I am going to stick with that method for Tux3 or just update the disk
:image of the free tree to include the log transaction blocks and
:somehow avoid logging those particular free tree changes to disk.  Hmm,
:a choice of two ugly but workable methods, but thankfully neither
:affects the disk image.

    You might be able to adapt the delayed-reuse concept I used in HAMMER
    to simplify the problem, possibly to the point of making those 
    operations replayable.

    Another thing I did in HAMMER was reserve space on the frontend for 
    data blocks without modifying the freemap at all.  This allows the
    frontend to write file data blocks to the physical media without
    having to wait for synchronization with meta-data or the backend or
    anything.

    I keep track of the reserved space with small in-memory structures
    (whos function is to prevent other allocations from using the reserved
    space until it can be committed).  Freemap updates are then committed
    in the same flush that commits the related B-Tree insertions and
    deletions.  Thus the freemap is kept in sync.

:...
:>     I also noticed that very large files also wind up with multiple versi=
:ons
:>     of the inode, such as when writing out a terrabyte-sized file.=20
:
:Right, when writing the file takes longer than the snapshot interval.

    Yah.  It isn't as big a deal when explicit snapshots are taken.  It
    is an issue for HAMMER because the history is more of a continuum.
    Effectively the system syncer gives us a 30-60 second granular snapshot
    without having to lift a finger (which incidently also makes having
    an 'undo' utility feasible).

:So the only real proliferation is the size/mtime attributes, which gets
:back to what I was thinking about providing quick access for the
:"current" version (whatever that means).

    st_size is the one thing we can't get away from.  For HAMMER,
    mtime updates don't roll new inodes and both mtime and atime are
    locked to the ctime when accessed via a snapshot (so one can use
    a (tar | md5) as a means of validating the integrity of the snapshot).

:Note: one slightly bizarre property of this scheme is that an inode
:can be reused in any version in which its link count is zero, and the
:data blocks referenced by different versions in the same file index
:can be completely unrelated.  I doubt there is a use for that.

    It might complicate mirroring / clustering-style operations, but
    you may not be targetting that sort of thing.  HAMMER is intended to
    become a cluster filesystem so I use 64 bit inode numbers which are
    never reused for the entire life of the filesystem.

:Ext2 dirents are 8 bytes + name + round up to 4 bytes, very tough to
:beat that compactness.  We have learned through bitter experience that
:anything other than an Ext2/UFS style physically stable block of
:dirents makes it difficult to support NFS telldir cookies accurately
:because NFS vs gives us only a 31 bit cookie to work with, and that is
:not enough to store a cursor for, say, a hash order directory
:traversal.  This is the main reason that I have decided to go back to
:basics for the Tux3 directory format, PHTree, and make it physically
:stable.

    The cookies are 64 bits in DragonFly.  I'm not sure why Linux would
    still be using 32 bit cookies, file offsets are 64 bits so you
    should be able to use 64 bit cookies.

    For NFS in DragonFly I use a 64 bit cookie where 32 bits is a hash key
    and 32 bits is an iterator to deal with hash collisions.  Poof,
    problem solved.
    
:In the PHTree directory format lookups are also trivial: the directory
:btree is keyed by a hash of the name, then each dirent block (typically
:one) that has a name with that hash is searched linearly.  Dirent block
:pointer/hash pairs are at the btree leaves.  A one million entry
:directory has about 5,000 dirent blocks referenced by about 1000 btree
:leaf blocks, in turn referenced by three btree index blocks (branching
:factor of 511 and 75% fullness).  These blocks all tend to end up in
:the page cache for the directory file, so searching seldom references
:the file index btree.

    That seems reasonable.  I'll look up the PHTree stuff, it sounds
    interesting.  Personally speaking I don't mind eating 64 bytes per
    directory entry.  I can accomodate file names up to 16 bytes in
    the B-Tree element itself so 95% of the namespace doesn't even
    require a look-aside data reference.

    The one mistake I made, which I didn't have time to fix in the
    first HAMMER release, is that my directory hash is a standard crc
    which means that records in the B-Tree aren't even close to being
    in any sort of lexical order.  I will probably wind up changing
    the hash calculation a bit to get some lexical sorting information
    stuffed into it... should be doable since the hash keys are 64 bits.

:will pass Hammer in lookup speed for some size of directory because of
:the higher btree fanout.  PHTree starts from way behind courtesy of the
:32 cache lines that have to be hit on average for the linear search,
:amounting to more than half the CPU cost of performing a lookup in a
:million element directory, so the crossover point is somewhere up in
:the millions of entries.

    Well, the B-Tree fanout isn't actually that big a deal.  Remember
    HAMMER reblocks B-Tree nodes along with everything else.  B-Tree
    nodes in a traversal are going to wind up in the same 16K filesystem
    block.

    That leaves just element space overhead, which frankly is not much
    of an issue on a production system because kernels have to cache a lot
    more then just the directory entry.  The in-memory vnode and inode
    structures take up a lot more space then the cached directory entry.
    So even though HAMMER's directory entry is a 64-byte B-Tree element
    and roughly double the overhead of a UFS-style directory entry, the
    overall cache footprint in the system is only ~20% worse.
    
:..
:>=20
:>     Beyond that the B-Tree is organized by inode number and file offset.
:>     In the case of a directory inode, the 'offset' is the hash key, so
:>     directory entries are organized by hash key (part of the hash key is
:>     an iterator to deal with namespace collisions).
:>=20
:>     The structure seems to work well for both large and small files, for
:>     ls -lR (stat()ing everything in sight) style traversals as well as=20
:>     tar-like traversals where the file contents for each file is read.
:
:That is really sweet, provided that you are ok with the big keys.  It
:was smart to go with that regular structure, giving you tons of control
:over everything and get the system up and running early, thus beating the
:competition.  I think you can iterate from there to compress things and
:be even faster.  Though if I were to presume to choose your priorities
:for you, it would be to make the reblocking optional.

    Yes, my thoughts exactly.  Using a fairly large but extremely flexible
    B-Tree element chopped the code complexity down by a factor of 4 at
    least.  If I hadn't done that HAMMER would have taken another year
    to finish.

    I plan on expanding efficiency mainly by making the data references
    fully extent-based, with each B-Tree element referencing an extent.
    In fact, read()s can already handle arbitrary extents but write
    clustering and historical access issues forced me to use only two
    block sizes for writes (16K and 64K).  A HAMMER B-Tree element
    is theoretically capable of accessing up to 2G of linear storage.

:...
:>     that the B-Tree can seek to directly.  In fact, at one point I tried
:>     indexing on delete_tid instead of create_tid, but created one hell of=
: a
:>     mess in that I had to physically *move* B-Tree elements being deleted
:>     instead of simply marking them as deleted.
:
:This is where I do not quite follow you.  File contents are never
:really deleted in subsequent versions, they are just replaced.  To
:truncate a file, just add or update size attribute for the target
:version and recover any data blocks past the truncate point that
:belongs only to the target version.

    Truncation is a deletion.  That is, the B-Tree elements have to
    be marked as having been deleted so if you truncate-extend a file
    you see the new space as being full of zeros.

    Take a 1MB file.  Truncate it to 512K, then truncate-extend the
    file back to 1MB.  Or seek-write to extend the file...
    same difference, really.

:Dirent deletion is handled by updating the dirent block, possibly
:versioning the entire block.  Should the entire block ever be deleted
:by a directory repacking operation (never actually happens on the vast
:majority of Linux systems) that will be handled as a file truncate.
:
:I see that in Hammer, the name is deleted from the btree instead, so
:fair enough, that is actual deletion of an object.  But it could be
:handled alternatively by adding a "this object is not here" object,
:which might be enough to get rid of delete_tid entirely and just have
:create_tid.

    Hmm.  I think the delete_tid is less complex.  I'm not really
    concerned about saving 8 bytes in the B-Tree element.  I don't 
    want to make it bigger then the 64 bytes it already is, but neither
    am I going to add whole new subsystems just to get rid of 8 bytes.

:I plan to scan the whole btree initially, which is what we do in ddsnap
:and it works fine.  But by looking at the mtime attributes in the inode
:I can see in which versions the file data was changed and thus not have
:to scan most files.  Eventually, building in some accelerator to be able
:to skip most inode leaves as well would be a nice thing to do.  I will
:think about that in the background.

    Yah, that makes sense.  You won't have to scan every last tidbit of
    information.  HAMMER would have had to if I had not added the
    propagating transaction id feature to it.

    You are still going to have a scaleability issue but it sounds like
    you could implement the exact same optimization and completely solve
    the problem.
 
:>     You may want to consider something similar.  I think using the
:>     forward-log to optimize incremental mirroring operations is also fine
:>     as long as you are willing to take the 'hit' of having to scan (though
:...
:
:There is a slight skew in your perception of the function of forward
:logging here, I think.  Forward logs transactions are not long-lived
:disk objects, they just serve to batch together lots of little updates
:into a few full block "physical" updates that can be written to the
:disk in seek-optimized order.  It still works well for this application.

    Yes, I understand now that you've expanded on that topic.

:>     The way it works is that HAMMER's frontend is almost entirely
:>     disconnected from the backend.  All meta-data operations are cached
:>     in-memory.  create, delete, append, truncate, rename, write...
:>     you name it.  Nothing the frontend does modifies any meta-data
:>     or mata-data buffers.
:
:Things are a little different in Linux.  The VFS takes care of most of
:what you describe as your filesystem front end, and in fact, the VFS is
:capable of running as a filesystem entirely on its own, just by
:supplying a few stub methods to bypass the backing store (see ramfs).
:I think this is pretty much what you call your front end, though I
:probably missed some major functionality you provide there that the
:VFS does not directly provide.

    Well, for file data, up to a point, and HAMMER does just use the 
    kernel's buffer cache for file data.

    Not for directory entries.  I'm pretty sure Linux uses a throw-away
    namecache just like we do, so unless it has a mechanic for flush
    sequencing of dirty namecache entries you need in-VFS structural
    support for caching the operations.

    In particular, caching a namespace deletion (rm, rename, etc)
    without touching the meta-data requires implementing a merged lookup
    so you can cancel-out the directory entry that still exists on-media.
    So it isn't entirely trivial.

    Things related to file data are handled by the kernel for the most
    part, but truncations still need considerable support within the
    VFS to properly handle seek-write or truncate file extensions.
    In order to avoid touching any meta-data when handling a truncation,
    a similar canceling field in the in-memory inode needs to be
    maintained until the backend is able to delete the related data
    blocks.  Lookups have to take into account the cached truncation offset
    so as not to read data from the media past the truncation point (e.g.
    if you seek-write-extend or truncate-extend the file immediately
    after truncating the file).

    Most filesystems will dirty meta-data buffers related to the media
    storage as part of executing an operation on the frontend.  I don't
    know of any which have the level of separation that HAMMER has.

:Note: no other filesystem on Linux current works this way.  They all
:pretty much rely on the block IO library which implements the fairly
:goofy one-block-at-a-time transfer regimen.  The idea of setting up bio
:transfers directly from the filesystem is unfortunately something new.
:We should have been doing this for a long time, because the interface
:is quite elegant, and actually, that is just what the block IO library
:is doing many levels down below.  It is time to strip away some levels
:and just do what needs to be done in the way it ought to be done.

    Jeeze, BSD's have been doing that forever.  That's what VOP_BMAP is
    used for.  I'm a little surprised that Linux doesn't do that yet.
    I'll expand on that down further.

:...
:>     the media.  The UNDO *is* flushed to the media, so the flush groups
:>     can build a lot of UNDO up and flush it as they go if necessary.
:
:Hmm, and I swear I did not write the above before reading this paragraph.
:Many identical ideas there.

    Yah, I'm editing my top-side responses as I continue to read down.
    But I'm gleaning a lot of great information from this conversation.
    
:>     for fsync() operations... those would go much faster using a forward
:>     log then the mechanism I use because only the forward-log would have
:>     to be synchronized (not the meta-data changes) to 'commit' the work.
:>     I like that, it would be a definite advantage for database operations.
:
:I think you will like it even more when you realize that updating the
:filesystem header is not required for most transactions.

    A definite advantage for a quick write()/fsync() like a database might
    do.  I might be able to implement a forward-log for certain logical
    operations.

    Due to my use of a fixed UNDO FIFO area I can pre-format it and I can
    put a 64 bit sequence number at the head of each block.  There would be
    no chance of it hitting a conflict.  Hmm.  Yes, I do think I could get
    rid of the volume header update without too much effort.

:A versioned extent:
:
:   struct extent { unsigned version:10, count:6, block:24; };
:
:Directory format to index the extent within a leaf:
:
:   struct entry { unsigned loglo:24, offset:8 };
:   struct group { unsigned loghi:24, count:8 };
:
:Leaf layout (abridged):
:
:   struct leaf { unsigned groups; struct group[]; struct entry[]; struct el=
:ements[] };
:...
:See, I am really obsessive when it comes to saving bits.  None of these
:compression hacks costs a lot of code or has a big chance of hiding a
:bug, because the extra boundary conditions are strictly local.
     
    Man, those are insanely small structures.  Watch out for the cpu
    and in-kernel-memory tracking overhead.

:  * The linux page cache not only caches data but provides an efficient
:    access path to logical blocks that avoids having to consult the
:    filesystem metadata repeatedly under heavy traffic.

    Actually, BSD buffer caches are logical caches, and always have been.
    Linux actually took a page from our book on that one.  In fact, if
    you do a google search I'll bet you will see my name in some of those
    Linux VM system debates 5-10 years ago :-)

    The typical BSD (Open, Net, Free, DragonFly, etc) buffer cache structure
    is a logically indexed entity which can also contain a cached physical
    translation (which is how the direct data bypass works).

:...
:>     Are you just going to leave them in the log and not push a B-Tree
:>     for them?=20
:
:It did not occur to me to leave them in the logical log, but yes that
:is a nice optimization.  They will only stay there for a while, until a
:rollup is triggered, moving them into the inode table proper.  It gets
:a little tricky to figure out whether the amortization due to the
:logical logging step is worth the extra write.  There is a crossover
:somewhere, maybe around a hundred bytes or so, ideal for symlinks and
:acls.

    I'm just going to throw this out.  In an earlier incarnation of the
    HAMMER design I had a forward log which embedded data, and the idea
    was that I would leave the data in-place in the log.  I was going to
    use a blockmap to turn the log into a sparse-map as a means of
    recovering space from it.

    I eventually threw away that idea but it occurs to me that you might
    be able to use a similar idea to compress small bits of meta-data.

:> :I might eventually add some explicit cursor caching, but various
:> :artists over the years have noticed that it does not make as much
:> :difference as you might think.
:>=20
:>      For uncacheable data sets the cpu overhead is almost irrelevant.
:
:For rotating media, true.  There is also flash, and Ramback...

    Yes, true enough.  Still only a 20% cache hit though, due to the many
    other in-memory structures involved (namecache, vnode, in-memory inode,
    etc).

:>     I think it will be an issue for people trying to port HAMMER.  I'm tr=
:ying
:>     to think of ways to deal with it.  Anyone doing an initial port can=20
:>     just drop all the blocks down to 16K, removing the problem but
:>     increasing the overhead when working with large files.
:
:Do the variable sized page hack :-]

    Noooo!  My head aches.

:Or alternatively look at the XFS buffer cache shim layer, which
:Christoph manhandled into kernel after years of XFS not being accepted
:into mainline because of it.

    The nice advantage of controlling the OS code is I could change
    DragonFly's kernel to handle mixed block sizes in its buffer cache.
    But I still want to find a less invasive way in the HAMMER code
    to make porting to other OS's easier.

:Ext2 and Ext3) so going to extents is not optional.  I plan a limit of
:64 blocks per extent, cutting the metadata overhead down to 16K per
:gigabyte, which is hard to complain about.  Larger extents in the file
:index would not buy much and would mess up the nice, 8 byte
:one-size-fits-all versioned extent design.

    That is reasonable.  I don't plan on going past about a megabyte
    per extent myself, there's just no reason to go much bigger.

:>     Orphan inodes in HAMMER will always be committed to disk with a
:>     0 link count.  The pruner deals with them after a crash.  Orphan
:>     inodes can also be commited due to directory entry dependancies.
:
:That is nice, but since I want to run without a pruner I had better
:stick with the logical logging plan.  The link count of the inode will
:also be committed as zero, or rather, the link attribute will be
:entirely missing at that point, which allows for a cross check.

    It won't work for you if your logical log is meant to be short-lived.
    Programs can hold unlinked inodes open for hours, even days.  Even
    longer.  You would have to rewrite the entry in order to be able to
    move the caboose.  Well, that isn't a bad idea, just more complication.

:>     I do wish we had something like LVM on BSD systems.  You guys are very
:>     lucky in that regard.  LVM is really nice.
:
:But I thought you BSD guys have GEOM.  I think GEOM is really nice, and
:I think that Linux LVM has so many problems that I have started work on
:a replacement, LVM3.

    GEOM is fairly simplistic compared to LVM, but I'll keep your comments
    regarding LVM in mind (never having used it).  GEOM's claim to fame is
    its encryption layer.

    Maybe the ticket for DragonFly is to simply break storage down into
    a reasonable number of pieces, like cutting up a drive into 256 pieces,
    and create a layer to move and glue those pieces together into larger
    logical entities.

:>     BTW it took all day to write this!
:
:It took two days to write this ;-)
:
:I hope the fanout effect does not mean the next post will take four days.
:
:Regards,
:
:Daniel

    I'm editing down as much as I can :-)   This one took 6 hours, time to
    get lunch!

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>



[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]