Instead of holding a reference that will get invalidated by
dma_pushbuffer.pop(), hold it as a copy. This doesn't have any
performance cost since CommandListHeader is 8 bytes long.
As fetching command list headers and and the list of command headers is a fixed 1:1 relation now, they can be implemented within a single call.
This cleans up the Step() logic quite a bit.
Fetching every u32 from memory leads to a big overhead. So let's fetch all of them as a block if possible.
This reduces the Memory::* calls by the dma_pusher by a factor of 10.