I follow the argument but the proof of the pudding is in the eating. I don’t know what “battles” the author lost to PyTorch lately but a good test would be to modify one of the smaller models (maybe nanogpt) and swap out all of the softmax calls for his quiet softmax.
I didn’t see anything relevant on alternatives to softmax, since TFA is specifically questioning softmax in a multihead attention context.
Ultimately, neural networks are arbitrary function approximators. It doesn’t necessarily have to be “right” internally to fit the data. But if this new softmax allows transformers to learn more, that’s great.
I didn’t see anything relevant on alternatives to softmax, since TFA is specifically questioning softmax in a multihead attention context.
Ultimately, neural networks are arbitrary function approximators. It doesn’t necessarily have to be “right” internally to fit the data. But if this new softmax allows transformers to learn more, that’s great.