Spaces:
Running
on
Zero
setting inf in attention matrix
It seems like in line 426-428 you set everything before the sink tokens to have inf attention score, and everything in the question tokens to also have inf attention, then wouldn't torch.topk in line 439 simply pick the first k inf values? Maybe I have a misunderstanding but I thought it should be setting the sink tokens to be -inf and question tokens to be -inf, because otherwise what's the point of calculating the attention matrix if torch.topk is just gonna select the +inf values. Please correct me if I'm wrong.
Hello!
Thanks for the question.
In this implementation we keep always the first sink_tokens and the question tokens, everything in between istead is taken with respect to their score(importance) with respect to task/fs/question this is because sink tokens are very important because all the other succeding tokens attends to them see https://arxiv.org/pdf/2309.17453
and question tokens are preserved because are needed by the model to solve the task.
Best
Giulio