How does the merge operation from the branch-train-merge differ from that of the Zip-it paper? It seems that combining the ideas of branch and train for domain experts along with the zipping operation removes the explicit routing needed?
IIRC branch-train-merge relies on the all the branches starting from the same model so that they're all in roughly the same loss basin and we can just elementwise average the params. So yes, with zipping, we might be able to remove the need for a shared initial model (though probably not a shared architecture).
this is great content!
How does the merge operation from the branch-train-merge differ from that of the Zip-it paper? It seems that combining the ideas of branch and train for domain experts along with the zipping operation removes the explicit routing needed?
IIRC branch-train-merge relies on the all the branches starting from the same model so that they're all in roughly the same loss basin and we can just elementwise average the params. So yes, with zipping, we might be able to remove the need for a shared initial model (though probably not a shared architecture).