VRAM not freed during long generations (Gemma, max_new_tokens=3000)
When using the official Gemma example code but changing max_new_tokens=200 to 3000, I get a CUDA error:
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED during cublasSgemm call.
Additionally, even when the model gives a short response, VRAM remains occupied until all 3000 tokens are processed.
Hi @Nessit ,
By specifying max_new_tokens=3000 which means the model to prepare memory for generating up to 3000 tokens, regardless of how many are actually generated.
Even if the model replies with only a few tokens, the full memory buffer is still allocated and that memory stays locked until the process is done.
To solve this issue, try increasing max_new_tokens gradually: 200 → 500 → 1000, and monitor usage.
Also, using half-precision or quantized versions of the model can help save memory and improve performance.
I successfully executed the official Gemma example code in google colab with Runtime Type: T4 GPU as by specifying the max_new_tokens=3000, could you please refer to this gist file.
Thank you.
thank you for your answer! I understand your answer, but I'm encountering an issue with GPU utilization. When I ask short questions, I receive short responses, but the GPU remains occupied for an extended period after the answer is complete. I can't perform any other operations until this process finishes, suggesting the stop token might not be functioning properly.
For comparison:
With Qwen, using 3000 tokens allows me to ask both long and short questions - the GPU releases immediately after the answer appears.
With Gemma, regardless of question length or answer size, the GPU stays busy for the full duration needed to process 3000 tokens, blocking further operations.
This behavior significantly impacts workflow efficiency. Is there a way to make Gemma release GPU resources immediately after generating the complete answer, like Qwen does?
It seems the issue is with memory allocation for 3000 tokens; try gradually reducing max_new_tokens
, using half-precision (FP16) or quantized models, and manually releasing memory with torch.cuda.empty_cache()
after each generation.
I see its possible to do memory optimizations. I am using a quantified 27b model (gguf) which works fantastic on a 24gb rtx quadro passive in lm studio. And its certainly possible to push this by tweaking token allocation, for example increasing the context window notably increases its memory with respect to the chatlog which was quite a bit surprising. However i wonder if its possible for the model to "forget" posdibly with a smart selection of what is important or and maybe compress information somehow Since its a fully vision enabled model you can overload it fairly quickly by showing it some higres visual data. Is there any other mechanism to loose tokens exept the pytorch cache cleanup?