Finally, we provide an example of a whole language model: a deep sequence product spine (with repeating Mamba blocks) + language model head.
working on byte-sized tokens, transformers scale badly as every token should https://blanchesdkx198742.bloggerchest.com/29921896/5-easy-facts-about-mamba-paper-described