11:17 pm, 12 Aug 07
pointer tagging
This paper, Faster laziness using dynamic pointer tagging, came through my queue and though I didn't think much of it the first time through, it turns out to be more interesting than I had thought when I first read it.
First, some background on pointer tagging. In the implementation of a dynamic language you need to be able to figure out what exactly a given value (thinking concretely at the machine level here, like a value in a register) holds. One easy way to do this is to make everything a pointer to a known layout of an object, but that means that even integers are pointers and that's inefficient. A common trick is this: because pointers are word-aligned, the low bits are zero, so you can use the lowest bit(s) as a tag on a word to indicate that the integer is stored "inline": that is, if bit 0 is set, then bits 1-31 are the integer's value, while otherwise it's a pointer to some other sort of object. (If this is unfamiliar to you, read this overview of Ruby's VALUE datatype; I didn't know Ruby did this until I looked up that article but I expected that was their design.) I believe Lisps do something similar as well.
I made an old post¹ on here about O'Caml does this -- it's the same thing, or at least it was in 2003 -- but the cute bit with O'Caml is in their implementation. Since they're compiling to native code, they can actually use the x86 addressing hardware to add integers in a single instruction. That is: when you have two integers A and B that are represented in registers as
Ok, then: Haskell. Haskell² instead has this hideously inefficient-seeming implementation where all variables³ are -- get ready, it's a bit freaky -- pointers to code that is executed whenever the value of the variable is needed. This is a cute trick for lazy evaluation: the first time the value is actually needed, only then does the code for computing its value run; at the end of its run, this code overwrites itself with code that just quickly returns the result of its computation. So all subsequent accesses to that value (which again execute that code) can return the computed value instantly.
But this design is also terrible, because any time you access any of these variables you're doing a couple of jumps and (as this paper's introduction observes) you screw branch prediction and cause pipeline stalls. To me this design is very Haskell-y: sacrificing performance for internal purity and consistency. (For all of that, it still performs well in microbenchmarking contests -- why?)
Haskell's got the same two low bits on these pointers (3 bits, even, on 64-bit architectures) that are unused just like in these other languages. The paper says: let's use those bits to tag pointers to maybe avoid doing these jumps and speed things up!
...and right there was my understanding when I first skimmed through this paper -- which also has nice architectural descriptions about GHC and is pretty easy to follow -- and it seemed straightforward enough to not warrant posting about here.
But here's actually something more interesting going on. (As common when I post here, I'm embarrassing myself in writing this because I clearly didn't understand the paper at in depth on my first-read through.) The thing about Haskell is that it's statically typed: you don't need to use the tag bits to represent information about what type of data you're looking at, because that's known at compile-time -- so not only can you use these bits, you can use them differently and specifically depending on what type of data you're considering!
So for a list, the pointer can encode whether you've got a cons cell or the end of the list. Or for a boolean, the pointer directly encodes whether it's pointing at true or false. (Due to the way Haskell's implemented, these are both specific examples of the underlying implementation of this feature.) When it's time to use the value you don't need to follow the pointer at all.
There's still lazy evaluation to support: if the bottom bits are 00, that encodes that you still need to do the jump to code to compute the value. But once the value has been computed, for data types that have three or fewer cases for their value (including booleans and lists), that value be encoded directly into the pointer. For types that have more than three cases, you can still encode whether they've been evaluated already or not (which allows the implementation to read the value behind the pointer without executing code). And again, you can encode all of this information in just two bits because, as a statically-typed language, you can have different behavior for different types.
The paper claims this change is a few hundred lines to the Haskell compiler and produces a 10-14% speedup over their benchmarks suite.
(PS: What about O'Caml? They're statically typed too, so why do they need the this-is-an-integer tagging in the first place? Totally guessing here, but maybe it's so the garbage collector can know that integer values (indicated with a low bit of 1) are not to be followed as pointers. But that's totally a guess...)
1 I tracked it down to discover all of the links were dead. :(
2 Here by "Haskell" I mean "the GHC implementation of Haskell".
3 Haskell doesn't really have variables, because values can't change. But hopefully you get what I'm saying.
First, some background on pointer tagging. In the implementation of a dynamic language you need to be able to figure out what exactly a given value (thinking concretely at the machine level here, like a value in a register) holds. One easy way to do this is to make everything a pointer to a known layout of an object, but that means that even integers are pointers and that's inefficient. A common trick is this: because pointers are word-aligned, the low bits are zero, so you can use the lowest bit(s) as a tag on a word to indicate that the integer is stored "inline": that is, if bit 0 is set, then bits 1-31 are the integer's value, while otherwise it's a pointer to some other sort of object. (If this is unfamiliar to you, read this overview of Ruby's VALUE datatype; I didn't know Ruby did this until I looked up that article but I expected that was their design.) I believe Lisps do something similar as well.
I made an old post¹ on here about O'Caml does this -- it's the same thing, or at least it was in 2003 -- but the cute bit with O'Caml is in their implementation. Since they're compiling to native code, they can actually use the x86 addressing hardware to add integers in a single instruction. That is: when you have two integers A and B that are represented in registers as
(A << 1 | 1) and (B << 1 | 1), they can compute their representation of the sum, ((A+B)<<1 | 1), without doing all the bit math.Ok, then: Haskell. Haskell² instead has this hideously inefficient-seeming implementation where all variables³ are -- get ready, it's a bit freaky -- pointers to code that is executed whenever the value of the variable is needed. This is a cute trick for lazy evaluation: the first time the value is actually needed, only then does the code for computing its value run; at the end of its run, this code overwrites itself with code that just quickly returns the result of its computation. So all subsequent accesses to that value (which again execute that code) can return the computed value instantly.
But this design is also terrible, because any time you access any of these variables you're doing a couple of jumps and (as this paper's introduction observes) you screw branch prediction and cause pipeline stalls. To me this design is very Haskell-y: sacrificing performance for internal purity and consistency. (For all of that, it still performs well in microbenchmarking contests -- why?)
Haskell's got the same two low bits on these pointers (3 bits, even, on 64-bit architectures) that are unused just like in these other languages. The paper says: let's use those bits to tag pointers to maybe avoid doing these jumps and speed things up!
...and right there was my understanding when I first skimmed through this paper -- which also has nice architectural descriptions about GHC and is pretty easy to follow -- and it seemed straightforward enough to not warrant posting about here.
But here's actually something more interesting going on. (As common when I post here, I'm embarrassing myself in writing this because I clearly didn't understand the paper at in depth on my first-read through.) The thing about Haskell is that it's statically typed: you don't need to use the tag bits to represent information about what type of data you're looking at, because that's known at compile-time -- so not only can you use these bits, you can use them differently and specifically depending on what type of data you're considering!
So for a list, the pointer can encode whether you've got a cons cell or the end of the list. Or for a boolean, the pointer directly encodes whether it's pointing at true or false. (Due to the way Haskell's implemented, these are both specific examples of the underlying implementation of this feature.) When it's time to use the value you don't need to follow the pointer at all.
There's still lazy evaluation to support: if the bottom bits are 00, that encodes that you still need to do the jump to code to compute the value. But once the value has been computed, for data types that have three or fewer cases for their value (including booleans and lists), that value be encoded directly into the pointer. For types that have more than three cases, you can still encode whether they've been evaluated already or not (which allows the implementation to read the value behind the pointer without executing code). And again, you can encode all of this information in just two bits because, as a statically-typed language, you can have different behavior for different types.
The paper claims this change is a few hundred lines to the Haskell compiler and produces a 10-14% speedup over their benchmarks suite.
(PS: What about O'Caml? They're statically typed too, so why do they need the this-is-an-integer tagging in the first place? Totally guessing here, but maybe it's so the garbage collector can know that integer values (indicated with a low bit of 1) are not to be followed as pointers. But that's totally a guess...)
1 I tracked it down to discover all of the links were dead. :(
2 Here by "Haskell" I mean "the GHC implementation of Haskell".
3 Haskell doesn't really have variables, because values can't change. But hopefully you get what I'm saying.
It can also give you a little bit of flexibility in boxing style. Some static languages -- IIRC OCaml is one of them -- have more than one method of boxing, depending on the amount of polymorphic specialization at any given site: some specializations have enough information on hand to collapse some sub-boxing of terms, giving contiguous allocations; other places the same datum needs to be unpacked and reboxed in a looser pointer network because the substructure is opaque at that point and you don't want to duplicate the code. If you want this sort of additional representational flexibility -- which is essentially "underneath" the type system, but can have huge performance implications -- it can help to keep a couple tag bigs aside. You can have similar special cases arise when you're doing experiments with escape analysis, optimistic concurrency, dynamic specialization, etc.
IIRC the GHC STG also has a lot of orthogonal reasons for preferring their thunk representation; it's not just "purity of style". They have done a lot of researchy stuff too -- fiddling with parallelism and implicit distributed computation, various GCs, various debugging mechanisms, various FFIs, various calling conventions -- and I think their representation is (or was) designed to fit many such needs.
in the case of GHC, most probably strictness analysis.
For O'Caml I would guess that the tagging for integers is to support bignums.
The thing that I thought was cleverest about the ghc pointer tagging paper was their realization that imprecise tags would work, with only a small performance penalty for incorrect tags. (The penalty being the same speed as the code before pointer tagging.)
My guess about the tag bit is that it's because the ancestral Caml (1985) predates the advent of tag-free garbage collection (1989-1994).
Smalltalk-80 used tags too. An object reference was 16 bits, with the LSB indicating whether it was an integer or an index into the global array of object pointers. Yes, that meant you could only have 32,768 objects. And integers outside the range ±16383 turned into bignums.
In fact one of my tasks the first summer I spent as an intern at Xerox was to work around a side effect of the integer limit — their text editor view became horribly slow when there was enough text that the view height exceeded 16383 pixels. So I hacked up a subclass that used line numbers instead of pixel y coordinates internally. Nasty.
Entirely right
Moose, you're entirely right. Even the earliest Knight Machines had extra bits for tagging - apparently for both typing and GC; later machines like the 3600 expanded the bits, but mostly used them for other things, so I guess there was only so far you could go. (They did CDR coding and error-correction, IIRC).ocaml stuff
Your guess as to why ocaml tags ints is right.The bytecode interpreter (Written in C) has a pretty clever way to take that bit into account, without having to always do a lot of shifting:
(Val_long() and Long_val() are macros that add/remove that tag bit.)
(The historical context of that trick, and much more fun stuff, is related here by Olin Shivers.)
So it's kind of odd (so to speak) that OCaml sets the tag bit for ints and clears it for blocks instead of vice versa.
And apparently MIT Scheme decided to tag with the top six bits of the word, because 64MB was a much larger amount of memory in 1980 than now, but I don't know the details beyond that.