The x86 instruction set has a somewhat peculiar set of SIMD integer multiply operations, and Intel’s particular implementation of several of these operations in their headline core designs has certain idiosyncrasies that have been there for literally over 25 years at this point. I don’t actually have any inside information, but it’s fun to speculate, and it gives me an excuse to waffle about multiplier designs, so here we go!
MMX
x86 doesn’t have explicit SIMD integer operations before MMX, which first showed up in – no big surprise – the Pentium MMX. Said Pentium MMX offers exactly three SIMD integer multiply instructions, and all three of them originally took 3 cycles (fully pipelined).
The first and most basic one is PMULLW, “packed multiply low word”, which interprets its two 64-bit MMX register operands as containing four words (which in x86, if you’re not familiar, means 16-bit integers) each. The corresponding lanes in both operands are multiplied and the low 16 bits of the result written to the corresponding lane of the destination. We don’t need to say whether these integers are interpreted as signed or unsigned because for the low 16 bits, it doesn’t matter. In short, it’s a basic element-wise multiply working on 16-bit ints.
The second available integer multiply is PMULHW, “packed multiply high word”. Again, we multiply 16-bit lanes together, which (in general) yields a 32-bit product, and this time, we get the top 16 bits of the result. This time, we need to make up our mind about whether the integers in question are considered signed or unsigned; in this case, it’s signed. A fun fact about “high multiply” type operations (which exist in a lot of instruction sets) is that there’s no practical way (at least, not that I’m aware of) to compute just those high bits. Getting those high bits right generally means computing the full product (in this case, 32 bits per lane) and then throwing away the bottom half. Therefore, a datapath that can support both types of multiplies will usually end up having a full 16×16->32-bit multiplier, compute all product bits, and then throw half of them away in either case.
That brings us to the third and last of the original Pentium MMX’s multiply-type instructions, and the most fun one, which is PMADDWD. I think this originally stands for “packed multiply and add words to doublewords”. That makes it sound like it’s a multiply-add type operation, but really it’s more like a two-element dot product: in pseudocode, PMADDWD computes result.i32[i]=a.i16[i*2+0] * b.i16[i*2+0] + a.i16[i*2+1] * b.i16[i*2+1]. That is, it still does those same four signed 16×16-bit multiplies we’ve been doing for the other two instructions, but this time with a “use the whole pig” attitude where the full 32-bit results are most definitely not tossed out. If we can’t return the whole result in a 16-bit operation, well, just pair even and odd pairs of adjacent lanes together and sum across them. Because when we’re summing across pairs of adjacent lanes, we get 32 bits to return the result in, which is perfect (we don’t need to worry about overflow here because the two constituent products were signed; they can’t get too large).
Now, this description sounds like we’re still finishing computation of 32-bit results for each of the 16 bit lanes, and then doing a separate 32-bit addition after to combine the two. That’s a possible implementation, but not necessary; this is not a post about how multiplications are implemented (some other time, maybe!), but the gist of it is that multiplier hardware already breaks down N-bit by N-bit multiplies into many smaller multiplications (the “partial products”) of a N-bit number by a much smaller digit set. The obvious one would be N-bit-1 bit products, which leaves just “x*0” and “x*1” products, but in practice other options are much cheaper. The partial products are then summed together in a kind of reduction tree, and again, there’s slicker ways to do it than just throwing down a whole bunch of fast adders, but the details don’t matter here. What does matter is that you can have either of the even/odd 16-bit multipliers do their normal thing until very close to the end, and then do the “cross-over” and final 32-bit summation very late (again with plenty of hardware reuse compared with the 16-bit result paths).
In short, not only does PMADDWD let us use both 32-bit results that we already computed anyway fully, it also doesn’t touch the first 90% of the datapath at all and can be made to share plenty of logic with the regular path for the final 10% too if desired. Which is why it’s fun.
SSE
The headline item for SSE was SIMD floating point operations (not my subject today), but it also patched a hole in the original MMX design by adding PMULHUW (packed multiply high unsigned word). This one does the multiply unsigned and gives you the high word result. Once again, this is a minor change to the hardware.
SSE2
This one added 128-bit integer SIMD whereas MMX was just 64-bit. It did so, in its initial implementations, by adding 128-bit registers, but still used a 64-bit datapath and issuing instructions over two cycles. Not surprisingly, then, all the SSE2 integer multiply instructions (and in fact the vast majority of SSE/SSE2 instructions in general) can be understood as working on independent 64-bit blocks at a time. (AVX/AVX2 would later do the same thing with 128-bit blocks.)
It does however add the rather awkward-seeming PMULUDQ (packed multiply unsigned doubleword to quadword), which multiplies two pairs of unsigned 32-bit integers (in bits [31:0] and [95:64] of either source) to give two 64-bit results. And it does so with the same latency as our 16-bit multiplies! Is that a much wider multiplier at work?
Turns out, not necessarily! Let’s look at a single 32-bit product a * b, and split a=(a1 * 65536) + a0 and b=(b1 * 65536) + b0. 65536 is of course 216 and we’re really just chopping a and b in half. Multiplying that out, we get:
a * b
=((a1 * b1) 15. This requires a slight tweak to the reduction tree to in the multiplier to sum in one extra term somewhere that we can use for the rounding bias. Grabbing different output bits from the result is not a big deal.
The more complicated one is PMADDUBSW which is like PMADDWD but on pairs bytes not words, and to keep things interesting, it’s an unsigned*signed multiply. I think this might have been inspired by AltiVec (or maybe there’s a common ancestor I’m not aware of?) which had this type of operation in its “msum” family of instructions (alongside a PMADDWD equivalent and some other operating modes), but the AltiVec version is nicer because it’s a 4-element dot product on bytes producing a 32-bit result. PMADDUBSW produces a word result which, in turns out, does not quite work. The problem is that multiplying unsigned by signed bytes means the individual product terms are in range [-128*255, 128*255]=[-32640,32640]. Our result is supposed to be a signed word, which means its value range is [-32768,32767]. If the two individual products are either near the negative or positive end of the possible output range, the sum overflows. PMADDUBSW decides to saturate the result, i.e. clamp it to lie within [-32768,32767] instead. This is well-defined but frequently quite annoying to work with.
In terms of implementation, I’ve not actually worked out the details here. I will point out that one way of designing multipliers is to use a few levels of a recursive decomposition into smaller multipliers much as I just did with PMULUDQ; chopping the constituent 16-bit multipliers up into byte pieces would presumably work, although at this point, we don’t just have some extra muxing near the beginning or end of the datapath, we’re also starting to add a lot of constraints on the rest of the internal implementation (if we’re trying to share as much hardware as possible, that is). We’ve just about pushed this as far as we can go.
SSE4.1
SSE4.1 adds PMULDQ which is PMULUDQ, but signed. The same basic approach as PMULUDQ should work, so I’ll not get into it.
It also added the at that point long-awaited PMULLD, doubleword-by-doubleword low multiplies. To date, we have not gotten any high multiplies for them, not in SSE4.1 nor in any subsequent extension, and it seems unlikely at this point.
Curiously, with PMULLD, something’s different: these have half the throughput and twice the latency as all the other multiply-type operations on Intel’s big core designs, and take two uops whereas all the other multiply-type operations mentioned so far take one.
Once again, I think the divide-and-conquer approach described for PMULUDQ above explains both. Looking at the product terms:
a * b
=((a1 * b1) 32 multipliers in there as well instead of adding complications. That seems to be what AMD has done, as well as Intel’s E-cores. But it’s still fun to speculate!
What about AVX-512 VPMULLQ, IFMA or VNNI?
The original VNNI is a pretty straightforward generalization of the PMADDWD/PMADDUBSW designs to include a final accumulation too, much like the aforementioned vmsum family of instructions in PowerPC. I have a suspicion this might be due to a combination of a) x86 SIMD datapaths post-AVX2 (which added FMAs) having native support for 3-operand instructions and b) quite possibly some AltiVec-era instruction patents having expired in the meantime. For a), the extra addition is not the hard part (as mentioned above, the extra term is very natural to sneak into any multiplier datapath), the actual cost is all in the extra wiring and bypassing to have a 3-operand datapath. But once it’s a sunk cost because you want float FMAs anyway, might as well use it! For b), I have no idea whether this is the case or not, it’s just funny to me that AltiVec had these integer dot product instructions from the standard while x86 took forever to add them (after people used PMADDUBSW with a follow-up PMADDWD by an all-1’s vector literally just to sum the pairs of words in a 32-bit lane together for something like a decade).
IFMA is a completely different story because even though it’s an integer multiply, it’s very clearly designed to use the double-precision multiplier in the floating-point datapath. Completely different multiplier with a different set of fun design constraints!
VPMULLQ, I have nothing to say about. Literally haven’t ever looked into, or tried to figure out, how it’s implemented. Might do so at some point, but not today!
And I think that’s all of them.
Information contained on this page is provided by an independent third-party content provider. This website makes no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact editor @cedarcity.business