evan_tech

08:54 am, 22 Jul 08

dns attack of doom

If I've learned anything from the new Kaminsky DNS attack, it's that if you want to keep something a secret while disclosing to a trusted subset of vendors, you do not include publicity-hungry overeager bloggers in the list of people who can keep their mouths shut.
10:22 am, 19 Jul 08

heavy protocols add up

I found this discussion of Android's dropping XMPP interesting. (Disclosure: I have no insider knowledge about any of this.) In particular, this remark about compression in the context of the notoriously fat XMPP protocol:
Bandwidth used means radio transmissions sent, and overhead means more work done by the processor, both of which take battery power and reduce battery life. Meanwhile, compression turned out to not be very helpful. Since it's negotiated during connection startup, it doesn't help with startup overhead. It does help somewhat with steady-state bandwidth, but at the expense of additional CPU cycles. The result is that enabling compression actually reduced battery life in our tests -- it took more power for the CPU to do compression than we saved on radio power.
(I wonder though: perhaps they could've used a simpler form of compression? XML ought to be "easy" to compress. Maybe the spec doesn't allow it?)

You see a similar phenomenon with HTTP on a heavy page. Like CNN.com: Firebug says it took 135 HTTP requests to load the page. Many of them are to a CDN and only have ~400 bytes of request headers but all the ones to cnn.com (including the ads, apparently!) include cookies pushing them up to nearly a kilobyte. The net result is that the latency of the page starts getting affected by the end-user's upstream bandwidth, which is usually terrible. (But now, having typed that out, I wonder: does it really matter? Even those heavier requests fit within a packet anyway...)
04:04 pm, 10 Jun 08

dolt libtool wrapper

dolt: double the build speed of libtool-dependent software. Examples of such software: kdelibs, gtk, libx11, libxml2, dbus.

So many CS problems are significantly improved by caching at the right place...

[[info]hober put this on reddit. I had it first! :P]
02:00 pm, 25 May 08

utf-8 is hard



Evite trying to do MÃ¥rtensson.
If you have used a Debian-based system to generate SSH keys in the past two years, your keys are likely no good. This document has instructions. In brief:

1) Delete your bad keys: .ssh/id_*. Fix all systems where you're trusting those keys (think .ssh/authorized_keys); someone has already published a table of all private keys, so it's just a matter of time before your system is brute-forced.

2) Update your systems. I see an "openssl-blacklist" package show up on both my Debian stable and my Ubuntu whateverletterthey'reon one. You'll get some debconf prompts about it clobbering stuff, including potentially your host keys, which means the next time you connect to the machine you'll get the "host keys have changed" message.

3) To make yourself feel less anxious, try running ssh-vulnkey to print an analysis of keys in standard paths on your system. (Run it as sudo ssh-vulnkey -a to check all users on your system.)
10:05 pm, 13 May 08

type-safe printf

printf is a function with a complicated type. In C we used to just give up and tell the compiler "this function takes some other stuff that you shouldn't worry about" with the amusing "..." builtin. These days compilers have special support for annotating printf-like functions to provide type-checking. The other side of this is that an implementation of printf necessarily has a little tokenizer/parser for run-time processing of the format string, along with the associated performance penalty*.

Yet pretty much all programs that involve format strings ought to have the format strings known statically. Even a mini-language like printf turns out to have enough power to not be able to safely process untrusted input, as the "poke" instruction (named %n) demonstrated by creating a completely new class of security vulnerabilities. And without the compiler to help with type coercions, it's easy to write something invalid, especially when you're playing fast and loose with integer and pointer sizes across platforms.

Perl and Ruby neatly sidestep this problem by using string interpolation: at parse/compile time, the compiler scans the strings for bits to be expanded and just rewrites the "format" string to the equivalent concatenation of literal strings and variable values, which then uses the normal language's support for pasting strings together. (Prove it to yourself: perl -e 'use strict; print "$x";' aborts with a "compilation error".) But sometimes you really just want something like printf, and both those languages fall back on "figure it out at runtime" for that.


Supporting printf at all proves to be pretty difficult in more strict languages which generally require all types to be known. OCaml's compiler does some crazy hacks where sometimes a quoted string is interpreted as a format, a six-parameter type that, for example, needs its own concatenation operator. Haskell at least encodes it in the user-available language with some typeclass magic that gets you to more or less to feature parity with dynamic languages -- failure at runtime if the parameters don't properly line up.

But it turns out there's a nice paper that provides a type-safe encoding of printf that doesn't rely on any fancy language features. The paper is structured like this: (1) "wouldn't it be nice if printf worked like this?" (2) "oh wow check it out, here are the functions!" I've been staring at it for a week and though I can sorta see how it works, it's unclear to me how anyone would come up with it. Here's an overview from a person who lacks sufficient brain to say much smarter (me).


To start with, you don't use a string for the format string. This sorta seems like you've already given up, but you could imagine a macro expanding a format string into the proper expression here, much like how Perl/Ruby's interpolation works. Since this is functional code we're talking about, it ends up being an expression involving some functions. Format string concatenation ends up being encodable as composition, which means you end up with the same operator as Perl (.) for pasting them together.

The basic task then is that you need print (some magic here) to be able to give you a function of varying types, depending on what the magic is, so that print format1 3 "foo" can be type-checked that format expects an int and a string. So the type of printf must be (some magic type involving an a) -> a, where the polymorphic a is a function produced by the magic. And here's where the magic drops, painful in its simplicity:

lit :: String -> (String -> a) -> String -> a
lit text k s = k (s ++ text)

int :: (String -> a) -> String -> Int -> a
int k s val = k (s ++ show val)

printf :: ((String -> String) -> String -> a) -> a
printf fmt = fmt id ""


And that's it. lit is what converts a literal string into a format, while int is the placeholder for an int. So "my int is %d\n" would be expressed as lit "my int is " . int . lit "\n". If you drop this in to GHC you'll see that the type of printf (lit "my int is " . int . lit "\n") really is, as you'd expect, Int -> String -- it's waiting for you to give it an int so it can dump out the formatted string. The result of printf is just a plain function, so you do all the normal sorts of things you'd want, like partially apply it or pass it to map. The formatters are plain functions, too, so you can add your own formatter that, for example, can accept a list (as he does in the paper).


So how's it work?

Look at the two formatters, int and lit. The k parameter to the formatters is a continuation: it's where each formatter should pass the string constructed so far when the formatter is done. Then the s input parameter is the string as constructed so far. You can mentally expand printf (lit "foo" . int) as (lit "foo" (int (id))) "", where that empty string is your starting string and id is the innermost continuation which just gives you back the string that's been constructed.

You can also look at int like this, just with some added parens for clarity: (String -> a) -> (String -> (Int -> a)). It takes the continuation as a parameter, and the function it returns is sorta the same shape as the continuation but with Int -> a in place of a -- that's how it tacks on its need for an int to the greater formatting requirement.

But from there... uh, the types just work out. I don't know. It's pretty much magic.


[Edit: a commenter on reddit linked to this discussion of the general technique, which I haven't read yet but looks promising.]


* Though I'd argue you're in a pretty bad place if printf performance is your bottleneck.
09:20 am, 12 May 08

related posts

Dear Wordpress,

The links added to posts via your "related posts" feature are rarely (perhaps never?) actually "related" to the post you add the links from. This harmful in two ways: one, every time I click one of those links thinking the page author had more information for me to read (like Greg Linden's blog, which often has great related links), I find unuseful content and it frustrates me. Two, and maybe more worrying for you, you're training me to ignore those links; if you ever do improve the quality of the matching system later I'll never discover it.
03:18 pm, 10 May 08

concurrent editing

While I'm on the subject of concurrent editing:

My reading group read Designing a commutative replicated data type a few months ago. The basic idea I've retained from the paper (is has been some months) is that one way to avoid conflicts is to design your data representation such that conflicts are impossible by making all operations commute, demonstrating the theory by presenting a design for a multiuser simultaneous editor. ("Like SubEthaEdit" is what I kept saying to people, but apparently few people I know have heard of it.) By representing positions within the buffer as adddresses related to characters you currently know about, and having a globally-defined resolution strategy for two edits to the same position, you can safely allow edits to come in from clients in any order and maintain consistent state.

(Commuting operaitons sounds like darcs, doesn't it? In fact, this fellow was discussing darcs's patch theory in connection to concurrent editing, though I suspect ultimately it's the wrong model...)

The paper's really pretty clever in a bunch of ways (like: how do you make a globally-consistent addressing scheme in the presence of simultaneous edits?) and a friend and I sat down to implement it as a web app with a bunch of Javascript. I had planned to target the release of App Engine but the lack of Comet (again, that bites me) sorta turned me off to the whole idea. And I have so many other projects I ought to be finishing first... He, however, powered on to get something that seems to mostly work but is a bit on the inefficient side.

PS If you're curious, the model Gobby uses (Gobby is another simultaneous editor) is described here. It's again the "hope we update often enough that there's no conflict" design I mentioned in the previous post.
A few separate posts, all in the same area.

1) Most (all?) the distributed bug tracking software I've glanced at stores bugs in a directory, one file per bug. This seemed like poor design to me. I confirmed by showing Brad the output of ls on one; his full response was "doesn't scale" and turning back to what he was working on.



2) Having thought more about the relationship between code and bug state, I have concluded I was thinking about it the wrong way. Going back and reviewing your comments, I see a bunch of you figured this out before me. Here's the critical piece I was missing. Code has history, which is tracked by the graph of related versions. Bug state both refers to the code history and also has its own history, in that new bugs are opened and old ones are closed. Those two histories related to bugs are not the same: even when examining old code, you generally care about the newest bug state. (This is why most modern bug systems only let you use the newest bug state; making changes to it permanently clobbers the old state. However, note that most do care about showing you the history of modifications to a bug; the interesting view is the most recent copy of the bug's entire history.)

As Aristotle and Lee pointed out on my older post, connecting the code history graph and the bug state could be modeled as annotations pointing at commits. The state of bugs present in a given version is the collection of all bugs states that have been attached to an ancestor of that version. This means discovering a bug in a previous release "infects" (to use his term, which is a good one) all branches derived from that release, and a given branch is only fixed once it merges the code that fixes the bug. (Making that work efficiently is an exercise for the reader; I have some ideas that aren't worth sharing yet.)



3) Part of the reason I got thinking about all of this because I wanted a separate feature: a command-line interface to bug tracking. I hate using web apps both because web sites become inaccessible, get slow, or go down (a problem addressed by making it distributed) and just because I hate clicking around on web forms (a website can't, for example, query my current checkout for which branch I'm claiming the bug is fixed on). You could make a CLI-based interface to Trac -- maybe one exists already -- and it would at least address the second half of that.

At a superficial level, the command-line problem isn't really at all same thing as a distributed system. But there also is a connection at a deep level. I like to say (and here by "say" I mean "think" because nobody ever wants to listen to me jabber on about this stuff) that even centralized projects have distributed branches every time someone edits a file in their own checkout; it's just that the tools we have for those branches are weaker than the tools typically used for "real" branches. On most systems you typically can't record your changes until you've verified they would merge cleanly with the master branch (though monotone/mercurial/fossil fix this implicitly and git does if you're using the proper workflow); on most systems you can't examine what happened upstream in the same way you examine changes that happened locally.

This same problem -- that forks happen on every checkout -- is true with any web-based database; it's just that when you're using a website you tend to commit more frequently so the forks don't get a chance to conflict. And for that reason the tools for managing conflicts are usually pretty weak, as anyone who's encountered a conflict on a bug tracker has probably experienced (my impression is that most just say "click back and type in what you were saying again"). My favorite bad conflict-resolution system is probably Google Docs, which, upon a conflict, pops up a window saying something to the effect of, "This paragraph was edited while you were editing it. Here is your text; please copy and paste it back into the document and figure out how to fix it."

So back to the command line: you need to solve this conflict problem anyway for a distributed system to work, but you also probably need to at least improve it for a command-line-editable centralized system to work, because with a command-line app you don't really get a "back button". You could implement back-button-like functionality, but if you're going to implement new functionality to handle this case, perhaps you could implement a more sane model instead.
10:58 pm, 6 May 08

terminal emulation

When you run script (see man 1 script if you're not familiar with it) it logs every keystroke you type, including backspacing over typos. Val asked today: how can you format that output so that it appears as the result? I found it interesting to think about. You'd imagine there'd be some way to run terminal emulation (?) and then somehow pipe that through cat. It's an interesting mixing of levels; I started running my mouth off about screen and vte but didn't come to a good conclusion.

Along those lines, it made me think of this interesting thread in which "cmd --color" produces escape codes while "cmd --color | cat" produces colored output. (Try to guess the circumstances before you read the other side of that link!)