Ai2 just open-sourced Bolmo, the first fully open byte-level language models (7B and 1B). Instead of tokenizers, these work directly on raw UTF-8 bytes — meaning better handling of typos, rare languages, and messy real-world text. Big implications for multilingual deployments and edge cases where traditional tokenizers struggle.
Ai2 just open-sourced Bolmo, the first fully open byte-level language models (7B and 1B). Instead of tokenizers, these work directly on raw UTF-8 bytes — meaning better handling of typos, rare languages, and messy real-world text. 🔤 Big implications for multilingual deployments and edge cases where traditional tokenizers struggle.