From a signal processing point of view, tokenization is "statistical quantization over given symbols"
- Quantization?
- Yeah as in analog to digital signal processing 😉
- Or as in digital to digital, casting float as int.
- Symbols
- In the case of language, adopting characters as symbols is too coarse/fine.
- Go one step back, we use characters as a convention. The symbols are the actual compressed/transmitted knowledge
- Statistical Data isn’t scarce. Thus, we can't disregard a statistical analysis of the matter.
- Why am I talking about it?
- Andrej's may ignite a trend
- Youtubers are using the word "tokens" while communicating with their (only 🤓?) audience.
- Other work
- Catchy title from Google Research: Irfan, Jose, David, Ming 🤩
- DeepMind Vision London
- Justin's UMich group on jpeg transformer