Skip to content

Latest commit

 

History

History
26 lines (21 loc) · 1.04 KB

tokenization-as-research.md

File metadata and controls

26 lines (21 loc) · 1.04 KB

WIP: How to frame tokenization as a cool research project?

Own stuff from slack convo

From a signal processing point of view, tokenization is "statistical quantization over given symbols"

  1. Quantization?
  • Yeah as in analog to digital signal processing 😉
  • Or as in digital to digital, casting float as int.
  1. Symbols
  • In the case of language, adopting characters as symbols is too coarse/fine.
  • Go one step back, we use characters as a convention. The symbols are the actual compressed/transmitted knowledge
  1. Statistical Data isn’t scarce. Thus, we can't disregard a statistical analysis of the matter.

Pointers