Big. In the late 80s I built a NN with phonemes as the equivalent of tokens. I’m happy and pleasantly surprised that Meta’s taken this route as it circumvents so many problems we’ve seen emerge. (I’m giddy really.) lol
Source date (UTC): 2025-05-13 16:11:15 UTC
Original post: https://twitter.com/i/web/status/1922323767214162430
Reply addressees: @rohanpaul_ai @AIatMeta
Replying to: https://twitter.com/i/web/status/1921976957991854432
IN REPLY TO:
@rohanpaul_ai
This LLM from Meta skips the tokenizer, reading text like computers do (bytes!) for better performance.
The training and inference code for BLT is released on GitHub.
@AIatMeta just released the model weights for their 8B-param Dynamic Byte Latent Transformer (BLT), an alternative to traditional tokenization.
BLT is a tokenizer-free LLM architecture that processes raw bytes directly.
→ Traditional LLMs rely on tokenization, a pre-processing step grouping bytes into a fixed vocabulary. This introduces issues like domain sensitivity, poor handling of noise, lack of orthographic knowledge, and multilingual inequity, besides being computationally suboptimal as compute is allocated uniformly per token, regardless of information density.
→ BLT tackles this by using a dynamic, learnable method to group bytes into patches based on context, typically using the entropy of the next-byte prediction. This allows allocating more compute where needed (high entropy bytes) and less where it’s not (predictable sequences like whitespace).
→ The architecture features three transformer blocks: two small byte-level models (Local Encoder and Local Decoder) and a large Latent Global Transformer. The Local Encoder maps input bytes to patch representations using cross-attention, the Latent Transformer processes these patches autoregressively, and the Local Decoder converts patch outputs back to bytes, again using cross-attention. Hash n-gram embeddings are added to byte embeddings to incorporate contextual information efficiently.
→ Performance Benchmark
BLT matches the training FLOP-controlled performance of Llama 3 up to 8B parameters and 4T training bytes. It can achieve up to 50% fewer inference FLOPs than Llama 3 by using larger average patch sizes (e.g., 8 bytes vs. Llama 3’s 4.4 bytes).
BLT shows enhanced robustness, outperforming Llama 3 (1T tokens) by +8 points average on noised HellaSwag and significantly on character-manipulation benchmarks like CUTE (+25 points). BLT also demonstrates better scaling trends in fixed-inference-cost scenarios, allowing simultaneous increases in model and patch size.
The training and inference code for BLT is released on GitHub.
Original post: https://twitter.com/i/web/status/1921976957991854432
Leave a Reply