> ...if we meet the shape 512x512 (512 is a power of 2, therefore a very "round" number in computer science), then maybe some kernel will be very fast; but then on a 512x511matrix, the same kernel may need to add some padding first to transform it into a round 512x512 matrix with zeros at the end of each row. Adding those zeros means shifting all rows, which is a very costly operation.