Questions on how routed experts are merged

#1
by chuhac - opened

Thank you all for this great work, but I have questions about:

custom merge of R1s and V3s routed experts

After carefully checking the weights, I turned out to find that the actual custom merge is that R1T reuse the expert routing gate and shared experts from V3-0324 model while using all routed experts from R1 model.
I am willing to discuss here about the inductive bias behind this design, since it seems that the R1T model proves to behave very well despite the mismatch between weights from different base models.

It appears to be as smart as R1 but much faster, using 40% fewer output tokens.

If anyone is interested in the details of the claims above, I'm glad to share the code to examine the expert-merge. A few takeaways are presented as follows:

  • Embedding: Reused from the V3 model.

  • The Attention part: Reused from the V3 model.

  • Dense Blocks

    • The first three Dense Blocks are completely reused from the V3 model.
  • MoE Blocks

    • Expert Parameters
      • Shared Experts: Directly reuse the shared experts from the V3 model.
      • Routed Experts: All reused experts are the routed experts from the R1 model.
    • Expert Router: Reused from V3 model.

Well its a merge so ofcourse they are reusing modules from both models , i fail to see your point?
They never claimed to have "created" the ultimate model , they merged the best from 2 worlds which is an amazing feat of it's own tbh.

@Daemontatox My point is that the claimed custom merge, which is supposed to indicate that a good merging between routed experts from R1 and V3 model, with the routing gate somehow also better handled.

However, the R1T model only reuses all routed experts from R1 model while reusing the routing gate from V3 model. The inductive bias of this is my interest point here.

TNG Technology Consulting GmbH org

Dear Junda,

thanks for your work, your very cool answer and the thoughts from you. Your analysis of the released model and its construction blueprint is exactly right - bravo!

We consider the released model a custom merge, as the MoE layers consist of selected parts of both parent models, as you pointed out. Of course, the zoo of possible combinations is much bigger. For example, we also created versions in which the parameters of the routed experts themselves are linearly mixed between V3-0324 and R1.

Pretty much all mixing settings that we tried resulted in workable models, so the construction process seems robust.

Nonetheless, the weight mixing dimension has interesting properties. For example, if you mix the routed experts' weights 50%-50% of both parent models, the reasoning traces visible in pure R1 output are not there. However, if you increase R1's share to 75% in the mix, the visible reasoning traces are there (!), and they have the subjectively nice, slightly more compact style than R1 itself.

If you would like to share your code, e.g. here on HF, we'd love to take a look, and others surely as well - thank you :-).

Cheers!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment