We split the text with a regex that mimics BPE pre-tokenization (Byte-Pair Encoding) used by modern LLMs: contractions, letter runs, digit runs, symbols and whitespace are separate segments. ASCII Latin characters are aggregated and converted at ~4 chars per token (OpenAI's rule of thumb). Digits: ~3 per token. Accented Latin, Cyrillic, Arabic and other non-Latin scripts: ~1.5 chars per token. CJK ideographs and Japanese/Korean: ~1 token per character. Leading single spaces are absorbed into the next word (as BPE does); only longer whitespace runs or newlines add a token. The result is a generic estimate that's representative of any modern frontier model, since their tokenizers diverge by less than ~10% on typical text.