It's the basic unit language models use internally, smaller than a word but larger than a character. In English, one token corresponds to roughly 4 characters or 0.75 words. For other languages, especially those with many accented characters or non-Latin alphabets, the ratio changes and typically more tokens are used.

Is the estimate exact?

No, it's approximate. Each model uses a different tokenisation scheme (BPE, SentencePiece, tiktoken). The estimate gives you an order of magnitude and helps you detect when you're approaching context limits. For exact costs or sensitive budgets, use the specific tokeniser of the model you'll work with.

Why do non-English languages use more tokens?

Tokenisers were trained on mostly English corpora. Less common characters (accents, Asian, Arabic) are split into more tokens, making the same sentence use more tokens in Portuguese or Japanese than in English. This has practical cost implications when using APIs billed per token.

Free Token Estimator - Count LLM Tokens Online

Text

Paste text above to see the estimate

How it works

We split the text with a regex that mimics BPE pre-tokenization (Byte-Pair Encoding) used by modern LLMs: contractions, letter runs, digit runs, symbols and whitespace are separate segments. ASCII Latin characters are aggregated and converted at ~4 chars per token (OpenAI's rule of thumb). Digits: ~3 per token. Accented Latin, Cyrillic, Arabic and other non-Latin scripts: ~1.5 chars per token. CJK ideographs and Japanese/Korean: ~1 token per character. Leading single spaces are absorbed into the next word (as BPE does); only longer whitespace runs or newlines add a token. The result is a generic estimate that's representative of any modern frontier model, since their tokenizers diverge by less than ~10% on typical text.

About this tool

Estimates the token count in a text, with approximations for the main current language models. Useful for predicting cost and limits when working with AI models via API, sizing prompts before sending them, comparing the verbosity of different phrasings, or simply understanding the relationship between words and tokens.

How to use

Paste the text in the box.
See the approximate token count in real time.
Compare with words and characters to get a feel for it.

Frequently asked questions

What is a token?: It's the basic unit language models use internally, smaller than a word but larger than a character. In English, one token corresponds to roughly 4 characters or 0.75 words. For other languages, especially those with many accented characters or non-Latin alphabets, the ratio changes and typically more tokens are used.
Is the estimate exact?: No, it's approximate. Each model uses a different tokenisation scheme (BPE, SentencePiece, tiktoken). The estimate gives you an order of magnitude and helps you detect when you're approaching context limits. For exact costs or sensitive budgets, use the specific tokeniser of the model you'll work with.
Why do non-English languages use more tokens?: Tokenisers were trained on mostly English corpora. Less common characters (accents, Asian, Arabic) are split into more tokens, making the same sentence use more tokens in Portuguese or Japanese than in English. This has practical cost implications when using APIs billed per token.

Token Estimator

About this tool

How to use

Frequently asked questions

Related tools