12:16 pm, 5 Aug 06
hoare's find algorithm, quickcheck
I generally find the study of algorithms at the low, which-sort()-is-the-fastest level to be uninteresting. I have no excuse, really: certain things appeal to certain people and this one just isn't for me. (Also, in the work I do, the real speed differences come from the larger rearrangements that the premature-optimization-haters always claim. Stuff like: which machines carry which data, getting rid of O(n^2), etc.)
So please forgive me for my intro-CS level of discussion, but I was curious about this the other day and enjoyed learning about it.
the algorithm
Here's an algorithm (credited to Hoare, not to be confused with
graydon) for finding the median of an unsorted array. Actually, it more generally gives you the kth smallest element, and getting the median is just k = length / 2.
observations
The simpler algorithm for this is, of course:
Another way of interpreting this is like a depth-first build of a binary tree: the guess is your current node, the partition picks all the nodes that belong on the left and right branches, and the recursion continues building the tree on the branch.
implementation
This (or an equivalent function) is in C++, of course, as the function
Here's a Haskell implementation of it using lists. (Yeah, not quite the same, but it carries the idea.)
verifying
But how do I know it works? There's a neat library called "QuickCheck" that does randomized testing: you tell it what sorts of inputs your function expects and a way to test whether the output is correct, and it randomly generates inputs to verify proper functioning.
QuickCheck is one of those libraries that seems to work by magic: no code preprocessing or introspection, just (as always) fancy types.
For example, here's how I write the requirement for
(Here,
And then you check it with simply:
And QuickCheck can figure out (through type inference, again) that this particular property takes an integer and a list, and automatically generates test cases and runs them.
When I first ran this it found all sorts of places where it'd fail that would be easy to forget with a unit test: for example, when the input list is empty, or when n is not in the bounds of the list. So they provide all sorts of fancy higher-level operations that let you modify the random-input-generators or restrict the inputs, or even have it provide breakdowns of the sorts of inputs it gave ("32% three-element lists.") The manual is hokey-looking but quite impressive.
P.S
Here's an interesting thread about how laziness affects big-O.
So please forgive me for my intro-CS level of discussion, but I was curious about this the other day and enjoyed learning about it.
the algorithm
Here's an algorithm (credited to Hoare, not to be confused with
- Guess that the kth element is the kth smallest.
- Partition the array into smaller and greater halves than that guess.
- If the smaller half has k-1 elements, the guess was correct.
- Otherwise, the kth smallest element is in one of the two halves, so adjust k and recurse into the appropriate one.
observations
The simpler algorithm for this is, of course:
- Sort the array.
- Take the middle element.
Another way of interpreting this is like a depth-first build of a binary tree: the guess is your current node, the partition picks all the nodes that belong on the left and right branches, and the recursion continues building the tree on the branch.
implementation
This (or an equivalent function) is in C++, of course, as the function
nth_element.Here's a Haskell implementation of it using lists. (Yeah, not quite the same, but it carries the idea.)
nth n l | n < ltlen = nth n lt
| n > ltlen = nth (n-ltlen-1) gt
| otherwise = pivot
where (pivot,l') = select n l
(lt, gt) = partition (< pivot) l'
ltlen = length ltpartition is a builtin; select is a function that pulls out the nth element from a list.verifying
But how do I know it works? There's a neat library called "QuickCheck" that does randomized testing: you tell it what sorts of inputs your function expects and a way to test whether the output is correct, and it randomly generates inputs to verify proper functioning.
QuickCheck is one of those libraries that seems to work by magic: no code preprocessing or introspection, just (as always) fancy types.
For example, here's how I write the requirement for
nth. I rely on the builtin sort to give me the correct answer.prop_Nth n l = nth n l == sort l !! nThis is just a normal function definition, but you can read the first line as "the property Nth holds for inputs n and l, when ..."
(Here,
!! is the nth-element-from-a-list operator. I assume it has a scary-looking name because it's O(n).)And then you check it with simply:
quickCheck prop_NthAnd QuickCheck can figure out (through type inference, again) that this particular property takes an integer and a list, and automatically generates test cases and runs them.
When I first ran this it found all sorts of places where it'd fail that would be easy to forget with a unit test: for example, when the input list is empty, or when n is not in the bounds of the list. So they provide all sorts of fancy higher-level operations that let you modify the random-input-generators or restrict the inputs, or even have it provide breakdowns of the sorts of inputs it gave ("32% three-element lists.") The manual is hokey-looking but quite impressive.
P.S
Here's an interesting thread about how laziness affects big-O.
Bounds for the median
That algorithm should run in 4n expected time, IIRC.The usual deterministic algorithm, median of medians, runs in 10n
Schonhage et al.[1] provide a deterministic algorithm that runs in 3n + o(n). I don't think there's a faster known deterministic algorithm. The lower bound for deterministic algorithms is 2n, as proven by Bent et al. [2]
LazySelect[3] is a generalized selection algorithm that runs in 1.5n + o(n) expected time (little-o is asymptotically negligable). This is equal to the overall lower bound.
[1] Schonhage, A., Paterson, M. S., and Pippenger, N. 1975 Finding the Median. Technical Report. UMI Order Number: CS-RR-006., University of Warwick.
[2] S.W. Bent and J.W. John. "Finding the median requires 2n comparisons". In Proceedings of the 17th Annual ACM Symposium on Theory of Computing, Providence, Rhode Island, pages 213--216, 1985.
[3] Motwani, R. and Raghavan, P. 1995 Randomized Algorithms. Cambridge University Press.
Re: Bounds for the median
I feel like noting that these selection algorithms are Las Vegas algorithms, because the term is cool.Re: Bounds for the median
You can peek at LazySelect using Google books. If that link doesn't work, here's the search and just hit page 48. It's more complicated than Hoare's algorithm, but not horrendously so (I think). I'd try to explain the bounds, but Randomized Algorithms is still on my to-read list and I don't already understand this sort of analysis.What interests me about algorithms and data structures, besides "hey, that's clever", are their motivations and applications. Basic in-memory sorting is pretty boring since we already have algorithms that match the lower bound of O(nlogn). However, there's always pseudopolynomial sorts (e.g. radix, counting) and adaptive sorting (e.g. splaysort). I'm also trying to practice writing a bit by adding things to Wikipedia when I have time and feel confident.
Sorry about the multiple comments. I really think I'm done this time :p
QuickCheck sounds like a Haskell-enmagicked version of LectroTest. Cool.
##[ x <- Int, y <- Int ]##is very much Hs list comprehension style, and he manages nicely to Perlify QuickCheck'sclass Arbitrary.Heh, I only just noticed that he credited QuickCheck as the inspiration for LectroTest in the docs.
I learned about LectroTest long before I knew of Haskell. Small world.