11:31 pm, 9 Sep 08
gat, a git clone in haskell
I've been pretty busy with work lately, so I may as well dump this on the internet before it gets too dusty. Though I think I understand Git decently enough now, I've been curious about the internals. So I sat down to start reading and documenting, for example, the on-disk pack file format. (Git includes some technical docs but they're pretty thin.) But then I started writing some Haskell code for fun and to verify my understanding was correct and before I knew it I had 5% of a Git clone written in Haskell. That is, not a "porcelain" (a frontend on top of their utilities) but a separate implementation that works with the on-disk format.
So, gat. It can read the index, loose objects, and a bit of pack files; it has an implementation of the delta decoder so it can read delta-compressed objects out of the pack format; it understands how to convert
But it's been fun to play with, at least. Maybe I'll clean up those docs and post them too at some point.
A few notes from reading Git code:
So, gat. It can read the index, loose objects, and a bit of pack files; it has an implementation of the delta decoder so it can read delta-compressed objects out of the pack format; it understands how to convert
git-svn^^ into reading refs/remotes/git-svn, recursively parsing out its parent pointers, and decoding that; it can do Git-style color diffs between various trees and commits. And finally and most importantly: it does some mmap()ing but is not at all fast, it is incorrect in the pieces it has implemented, and it does no writing of any data structures whatsoever -- that is to say, nothing useful.But it's been fun to play with, at least. Maybe I'll clean up those docs and post them too at some point.
A few notes from reading Git code:
- Git is a jumble of random nearly-commentless code, full of globals and strange state and not at all clear control flows. On the other hand, it's also much more Unixy than the code I'm used to reading, doing all sorts of tricks like using mmap() instead of read() (because the latter just involves an extra copy, y'know?) and forking. I am simultaneously impressed and terrified of what's likely going on in my kernel.
- To align some disk-based data structure Git uses this code:
(offsetof(struct cache_entry,name) + (len) + 8) & ~7
That's an off-by-one error (gets eight bytes of padding where zero would do) that I'm a little puzzled by -- maybe it's intentional? Maybe it's remained for backwards compatibility? - This goto also makes no sense to me:
ret = ent->data; if (ret && ent->p == p && ent->base_offset == base_offset) goto found_cache_entry; return unpack_entry(p, base_offset, type, base_size); found_cache_entry:(Those are the only two instances offound_cache_entryin the file...) - Git contains at least three incompatible bit-packed integer encodings, found in pack file offsets, object headers, and the delta format.
- It's crazy to me that in 2008 people are still writing code that is manually passing around buffers and lengths and verifying by hand that the space works out.
lesscan take all sorts of useful flags, causing it to e.g. only page if necessary, allow color but not other control codes through, and more! When Git spawns less as a pager, it actually execs less in the parent side of the fork, making itself the child of the less process; dear Unix people, explain to me why you'd do that? (I have my thoughts, but I'm curious about yours.)
nice to see I'm not alone
I spent some time in the past trying to study git's code (I did a similar thing to gat, but in python), and ended up with the same feeling of disgust for that code. I wrote a blog post (http://pcapriotti.wordpress.com/2007/10/08/some-thoughts-on-git/) about it and I was highly criticized as an incompetent C++ hacker that doesn't understand C.By the way, I'm happy to see that some work is going on in this direction. Great job! I hope this will not just stay vaporware.
Paolo
exec-ing less
The reason is that the shell is holding on the pid of the parent process, and you want the shell to think git has exited when less quits. So you replace the parent process with less, and you create the input to it in the child.ha!
The Need for Manual Memory Management
evan wrote, "It's crazy to me that in 2008 people are still writing code that is manually passing around buffers and lengths and verifying by hand that the space works out."Until you can write code that has the identical functionality and performance without manual memory management, the need still exists. That performance gain by manual memory management is a huge factor for git with large repos such as the Linux Kernel.
Re: The Need for Manual Memory Management
Even in C you can usually hide the details of the manual memory management behind an API and an opaque pointer. Using "raw pointer plus length" is asking for trouble.Re: The Need for Manual Memory Management
Yes, this is the comment I was trying to make -- I was surprised to see lots of "char* buffer, int len" arguments to functions rather than some ADT around them.Re: The Need for Manual Memory Management
Pass by value instead of pass by reference is a big win when you're too lazy to make your own locals, maybe?Re: The Need for Manual Memory Management
Given that I don't want to reimplement (nor wrap) strlen, strncpy, snrintf, et cetera, it makes sense to continue to use the calling convention they provide. The lack of automatic destructors is a big obstacle to greater use of ADTs in C, too.That style of goto is prevalent in Linux, and makes more sense in its traditional context: sched.c for example.
cool! come hack on gat with darcs people (25-26 oct)
Evan, this sounds like really interesting work! I've been longing for somebody to start working on a Haskell git implementation for a while. There is also user on #haskell (ejt) who is working on a Haskell implementation of hg. As a darcs guy, I think this is great having more people work on version control software in Haskell. If there is enough viable work on different systems, we could hopefully start to pool resources and just generally learn a lot from each other.Perhaps I could invite you to the upcoming darcs hacking sprint? It's on the weekend of 25-26 October, in Portland, Brighton and Paris (or on IRC).
http://wiki.darcs.net/index.html/Sprints
I think it would be lovely if you could go hack on gat for a weekend in a room of people hacking on darcs :-) Hopefully ejt will be there too.
Re: cool! come hack on gat with darcs people (25-26 oct)
I've long wanted to visit the Galois offices, but I'm not sure I can commit to a trip to Portland just for such an event. :)The goto
The goto is there to skip the return statement directly below it. Without the goto that same code would have had to been written as:ret = ent->data; if (ret && ent->p == p && ent->base_offset == base_offset) { // Everything in the function below the found_cache_entry label // ... // Possibly many, many lines. } else { return unpack_entry(p, base_offset, type, base_size); }Re: The goto
I think he meant, why use goto instead of:ret = ent->data;
if (!ret || ent->p != p || ent->base_offset != base_offset)
return unpack_entry(p, base_offset, type, base_size);
DeMorgan's is your friend.
Often I see much code to make me wretch:
if (a)
{
return true;
}
else
{
return false;
}
or
if (a)
{
return false;
}
else
{
return true;
}
or the ever popular
if (a)
{
}
else
{
...
}
Why use only 1 line when you can use 7 or 10 or more?
Re: The goto
Sorry for not using <pre> tags. It's my first time posting here.Re: The goto
Hmm, how about:ret = ent->data;
if (!(ret && ent->p == p && ent->base_offset == base_offset)) {
return unpack_entry(p, base_offset, type, base_size);
} else {
// Everything in the function below the found_cache_entry label
// ...
// Possibly many, many lines.
}