ChatGPT tokens and Unicode

The Endeavour 2025-03-08

Summary:

I mentioned in the previous post that not every Unicode character corresponds to a token in ChatGPT. Specifically I’m looking at gpt-3.5-turbo in tiktoken. There are 100,256 possible tokens and 155,063 Unicode characters, so the pigeon hole principle says not every character corresponds to a token. I was curious about the relationship between tokens and […]

The post ChatGPT tokens and Unicode first appeared on John D. Cook.

Link:

https://www.johndcook.com/blog/2025/03/08/chatgpt-tokens-and-unicode/

From feeds:

Statistics and Visualization » The Endeavour

Tags:

ai

Authors:

John

Date tagged:

03/08/2025, 14:04

Date published:

03/08/2025, 13:06