ChatGPT tokens and Unicode
The Endeavour 2025-03-08
Summary:
I mentioned in the previous post that not every Unicode character corresponds to a token in ChatGPT. Specifically I’m looking at gpt-3.5-turbo in tiktoken. There are 100,256 possible tokens and 155,063 Unicode characters, so the pigeon hole principle says not every character corresponds to a token. I was curious about the relationship between tokens and […]
The post ChatGPT tokens and Unicode first appeared on John D. Cook.