DragonFly kernel List (threaded) for 2006-02
[
Date Prev][
Date Next]
[
Thread Prev][
Thread Next]
[
Date Index][
Thread Index]
namecache coherency, 2nd turn
Hi,
Here is a revamp of the namecache coherency code (and nullfs).
I tried to follow the ideas we sort of agreed about in the thread "a take at
cache coherency", but real life wanted it differently sometimes...
- Diffstats:
kern/vfs_cache.c | 439 +++++++++++++++++++++++++++++++++++++++++++----
sys/namecache.h | 14 +
vfs/nullfs/null.h | 17 +
vfs/nullfs/null_vfsops.c | 193 ++++++++++----------
vfs/nullfs/null_vnops.c | 174 ++++++++++++++++--
5 files changed, 683 insertions(+), 154 deletions(-)
Ie., I'm not bothering with anything I shouldn't.
See the patch itself at
http://leaf.dragonflybsd.org/mailarchive/submit/2006-02/msg00021.html
- Locking: the code implements the ideas Corecode and Matt were
suggesting: groups are lockable via each member (O(1) locking); each
namecache structure carries a fallback copy of locking related data,
but it's a separate pointer which determines the actually used
locking entities.
I didn't introduce a separate "shadowinfo" structure type: I simply
found no reason to partition the namecache structure, as it's still
true that each namecache node carries exactly one copy of these. The
only new field (apart from pointer/list related to shadowing)
is the "nc_lockby" field which points to the namecache node via which
the current node is lockable.
Locking is MPSAFE as far as I'm concerned, that is, these changes
don't introduce more dependency on external serialization. As I
realized, Matt's nice "fallback copy" idea made it possible to
migrate the "lockpick" of a given node between two other ones in a
way so that it's not visible for lock contenders. (So we don't need
those ugly interlocks, regardless of GIANT).
- Group layout: I did *not* follow Matt's circular list idea. A group
is still just a tree, with pointers toward parents (overlayed levels)
and a list of kids (overlays). The basic kind of traversal I needed is
traversing a subtree identified by a node (it's root), and I didn't
find circular lists helpful in this respect. The group tree is
traversed iteratively with the "moving edge" algorithm [ad hoc name]
ie., the one which requires fixed storage (<previous node, actual node>)
and goes twice over each node. (You'll see a bit of bloat: the
traversal routine supports some features I didn't made use of finally.
This can be trimmed if those extras keep being unused.)
- Synchronization via groups: here is the greatest difference from what
we agreed in (or my understanding went astray the furthest here).
Basicly, the point is that groups are for synchronizing locks, and
that negative events (more concretely, unresolution) are propagated
upwards (and that's all: groups are used for nothing else).
But within that, there are two alternatives:
- Unstable groups: negative events blow up groups.
- Stable groups: the namecache backend code *never* splits up groups
(except at the end of a node's life cycle, during zapping), and each
shadow association is expected to be torn at the layer where it was
made.
In http://leaf.dragonflybsd.org/mailarchive/kernel/2006-01/msg00117.html
Matt says:
% The shadow chain would NEVER be allowed to be stale. It could be
% partially unresolved, but the linkages cannot ever be stale (i.e.
% cannot ever be incorrect). This generally means that an unresolved
% node needs to be disconnected from the chain. The re-resolution of
% the node would reconnect it.
-- from which I conclude that he'd go for unstable groups. (Although it's
true that he doesn't exactly specifies where should the disconnection
happen, so the contrary might as well be true).
I started going on to implement unstable groups, but ended up with
stable groups, claiming that unstable groups are both unnecessary and
are a dead end.
- Unnecessary: shadow associatons can just be finely torn where they
were created: in the resolver routine. Propagating unresolution will
yield control back to the resolver routine, even if group topology is
left intact; and then the resolver can start its work by tearing
the link so that staleness (eg., in rename) will be avoided.
- Dead end:
***
Feel free to skip this if you find the "unnecessary"
argument concvincing enough.
***
When blowing up a group, each part will acquire a self contained
locking state. It's the disconnector's responsibility to set those
corretly. It can't divine the necessary state without hints.
Therefore I didn't see other solution than keeping the local
locking data of each node up to date (even if it was ignored by the
cache_lock mechanism of groups), so that in case of the node
becoming disconnected and self contained, it will be just-in-time
equipped with the proper locking state.
This impiled a very funky locking interface: eg, a typical
null_nfoo() looked somehow like something like
...
ap->a_ncp = ncp->nc_shadowed;
cache_lock(ap->a_ncp);
vop_nfoo_ap(ap);
cache_unlock(ap->a_ncp);
-- here the { ncp, ncp->nc_shadowed, ... } group *is* locked at the
beginning, but ncp->nc_shadowed has to get an extra "personal" lock
because the group might go apart during the vop_nfoo_ap() call-down,
and it has to have its personal lock state set as expected by the
lower layer. Etc. (I don't go into details regarding possible solutions
of problems spawning at this point...)
And there is another reason for falling into more heavy locking: the
unstable model means that we usually can't derefer an nc_shadowed
member without locking, because someone else might destroy it any
time. With the stable model, the setter of nc_shadowed can have a
more reliable preconception regarding the contents of this field
(because she will be the unsetter as well).
All in all, the unstable approach ended up in a programming model I
couldn't make consistent. Each fixed panic resulted in seven new
one, like when fighting the hydra :) With the stable approach,
post-split state synchronization is always a local issue which
can be handled trivially. With that, causes of panics were just
slightly more than typos.
- Deadlocks. I thought over Corecode's concerns regarding deadlock
situations and I think I succeeded to write a deadlock safe "shadow
attach" routine.
- Renaming. I rewrote cache_rename() for two reasons:
- I felt it's not deadlock safe: cache_inval(tncp) is called
with fncp locked, and cache_inval does many locking on and off.
I didn't see any constraint which could make us relaxed wrt.
a deadlock on eg. fncp and some child of tncp... maybe it's just my
short-sightedness.
- When children are migrated, their overlays has to get
unresolved. This requires them got locked (unlike migration in
itself). So the code has to be rewritten in a way so that each
child can get its fifteen minutes fame.
- Nullfs. Except for null_resolve(), each method is still just a simple
delegation downwards. Multilayer nullfs mounts and renaming seem to work
fine. You can still get a crash by unmounting midlayers.
I'm a bit confused about the purpose of kernel@ and submit@, is it OK as
I tend to do, discussion on kernel@, code on submit@ ? Or should I keep
discussion on submit@ as well?
Csaba
[
Date Prev][
Date Next]
[
Thread Prev][
Thread Next]
[
Date Index][
Thread Index]