A token is the smallest unit of text that an LLM processes. It's not a word. A word is often multiple tokens, sometimes a token is part of a word. 75 words on average. So a 1000-token context is roughly 750 words. Models have context windows, maximum token lengths they can process. GPT-4 has 128k tokens, which is roughly 96k words. Llama 2 has 4k tokens. This context window limit matters.
It determines how much text you can give the model at once. If you're using retrieval-augmented generation and want to ground responses in documents, you're limited by context window size. If you're building a long-form reasoning system, context window is your constraint. Smaller context windows mean you can't reference long documents.
Larger context windows let you dump entire codebases into the model. Token counting isn't straightforward. Different tokenizers produce different token counts for the same text. The tokenizer affects everything. Common tokens like "the" might be single tokens. Uncommon words might be 3-4 tokens. Prompt engineering sometimes involves phrasing differently to reduce token usage.
You pay for LLM API calls by tokens, so understanding token economics changes how you structure your prompts and systems.