I follow the argument but the proof of the pudding is in the eating. I don’t kno... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		janalsncm on July 24, 2023 \| parent \| context \| favorite \| on: Attention Is Off By One I follow the argument but the proof of the pudding is in the eating. I don’t know what “battles” the author lost to PyTorch lately but a good test would be to modify one of the smaller models (maybe nanogpt) and swap out all of the softmax calls for his quiet softmax. I didn’t see anything relevant on alternatives to softmax, since TFA is specifically questioning softmax in a multihead attention context. Ultimately, neural networks are arbitrary function approximators. It doesn’t necessarily have to be “right” internally to fit the data. But if this new softmax allows transformers to learn more, that’s great.

LoganDark on July 24, 2023 [–]

> a good test would be to modify one of the smaller models (maybe nanogpt) and swap out all of the softmax calls for his quiet softmax.

You'd have to train the model with the quiet softmax before inferencing with it would work.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact