Replies: 3 comments 1 reply
-
It was implemented like this in #12412 - cc @MollySophia |
Beta Was this translation helpful? Give feedback.
-
It was implemented according to: https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.normalize.html |
Beta Was this translation helpful? Give feedback.
-
Yeah, I see base PyTorch uses the hard clamp, but both Llama4 and Qwen3 Next use the soft clamp (with adding the epsilon instead of max). Maybe we should admit a flag to specify which sort of clamping is needed? I'm not sure this will really matter in real-world scenarios too much, just bringing it up since I did encounter a discrepancy with a mock model. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Does anyone know why the l2norm implementation is different than in the typical PyTorch implementations? Ran into the divergence when testing Qwen3Next with some tiny models. Pytorch does soft clamping:
while GGML does hard clamping (with fmax(sum, eps)). Is there a specific rationale for it?
Beta Was this translation helpful? Give feedback.
All reactions