DragonFly users List (threaded) for 2011-07
[
Date Prev][
Date Next]
[
Thread Prev][
Thread Next]
[
Date Index][
Thread Index]
Re: Easy way to find identify files which share some content/blocks
On 2011-05-02, Justin Sherrill <justin@shiningsilence.com> wrote:
Hi Justin,
> You could dump out the B-tree information. I don't know how clear a
> picture would come from that, and it may require some massaging of
> data anyway since nonduplicated files may have some degree of
> matching, duplicated data anyway, especially when dealing with larger
> image file.
That's a bit beyond my current C programming skills I guess, and a
little to much effort for this little cleanup project. Anyway, thanks
for the idea.
> If you are sure that the corruption lies at the end of the files, you
> could loop over the files, read the first x bytes of each, then MD5
> that data. Matching MD5 = matching file.
It mostly is at the end. This suggestion (partitioning files into
chunks) is what I had done so far (on Linux) with a few lines of shell
(changed old existing script for that), then, due to inherent
inefficiencies, in python.
A handful of lines, and output "inode, chunkId, hash" to file or SQL,
then go from there.
I had hoped hammer, as a deduplicating filesystem, had tools that could
easily give me that information without "hacks" like above.
Regards
Thomas
> On Sun, May 1, 2011 at 2:39 PM, Thomas Keusch
><fwd+usenet-spam2011q2@bsd-solutions-duesseldorf.de> wrote:
>> Hello,
>>
>> now that Dragonfly's HAMMER has got deduplication I ask myself if there
>> is a simple way to identify "pairs" or groups of files which share a lot
>> of data, i.e. are mostly identical.
>>
>> I have a rather large repository of downloaded pictures, which contain
>> a lot of dupes in multiple locations. I have no problems finding those
>> given some time and a shell prompt.
>>
>> I'm interested in identifying broken files. Broken in the sense that
>> A is an incomplete version of B (some bytes missing), or B a damaged
>> version of A (some additional bytes at the end).
>>
>> Is there a way to get to something like this:
>>
>> "File A shares 1234 (98.3%) data blocks with file B"
>> "File A shares xxxx (xx.x%) data blocks with file C"
>>
>> Getting a step closer helps too.
>>
>> Thanks for any insights.
>>
>>
>> Regards
>> Thomas
>>
[
Date Prev][
Date Next]
[
Thread Prev][
Thread Next]
[
Date Index][
Thread Index]