4-bit quantization on newer nvidia hardware is being supported in training as we... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		deepsquirrelnet 14 days ago \| parent \| context \| favorite \| on: Qwen3.5 122B and 35B models offer Sonnet 4.5 perfo... 4-bit quantization on newer nvidia hardware is being supported in training as well these days. I believe the gpt-oss models were trained natively in MXFP4, which is a 4-bit floating point / e2m1 (2-exponent, 1 bit mantissa, 1 bit sign). It doesn't seem terribly common yet though. I think it is challenging to keep it stable. [1] https://www.opencompute.org/blog/amd-arm-intel-meta-microsof... [2] https://www.opencompute.org/documents/ocp-microscaling-forma...

zozbot234 14 days ago [–]

mxfp4 is a block-based floating point format. The E2M1 format applies to individual values, but each 32-values block also has a shared 8-bit floating point exponent to provide scaling information about the whole block.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact