> Don't GPUs also have out of order execution and instruction level parallelism?...

> Don't GPUs also have out of order execution and instruction level parallelism?

Not any contemporary mainstream GPU I am aware of. Sure, the way these GPUs are marketed does sound like they have superscalar execution, but if you dig a bit deeper this is either about interleaving execution of many programs (similar to SMT) or a SIMD-within-SIMD. Two examples:

1. Nvidia claims they have simultaneous execution of FP and INT operations. What this actually means is that they can schedule an FP and and INT operation simultaneously, but they have to come from different programs. What this actually actually means is that they only schedule one instruction per clock but it takes two clocks to actually issue, so it kind of looks like issuing two instructions per clock if you squint hard enough. The trick is that their ALUs are 16-wide, but they pretend that they are 32-wide. I hope this makes sense.

2. AMD claims they have superscalar execution, but what they really have is a packed instruction that can do two operations using a limited selection of arguments. Which is why RDNA3 performance improvements even on compute-dense code are much more modest. Since these packed instructions have limitations, the compiler is not always able to emit them.