KV Sharing, MHC, and Compressed Attention
(news.ycombinator.com)
1.
2.
3.
ZAYA1-8B matches DeepSeek-R1 on math with less than 1B active parameters
(news.ycombinator.com)
4.
ZAYA1-8B: An 8B Moe Model with 760M Active Params Matching DeepSeek-R1 on Math
(news.ycombinator.com)