> ## Documentation Index > Fetch the complete documentation index at: https://mintlify.com/alblandino/tokenizador/llms.txt > Use this file to discover all available pages before exploring further. # TokenizationService > Handles text tokenization using tiktoken with intelligent fallback mechanisms ## Overview The `TokenizationService` class manages all tokenization operations in Tokenizador. It integrates with the tiktoken library to provide accurate token IDs and counts, with intelligent fallback mechanisms when tiktoken is unavailable. This service supports 48 AI models from OpenAI, Anthropic, Google, Meta, and other providers. ## Constructor Creates a new TokenizationService instance and begins initialization. ```javascript theme={null} const service = new TokenizationService(); ``` **Properties initialized:** * `encoder`: null (set after initialization) * `isInitialized`: false (set to true when ready) * `initPromise`: Promise for initialization tracking * `isRealTiktoken`: Indicates if real tiktoken or fallback is being used ## Methods ### initializeTokenizer() Initializes the tiktoken encoder asynchronously. ```javascript theme={null} async initializeTokenizer() ``` Resolves when tokenizer is initialized (or fallback is ready) **Initialization process:** 1. Waits up to 10 seconds for tiktoken library to load 2. Checks multiple locations: global context, window object 3. Initializes `cl100k_base` encoding (GPT-4 compatible) 4. Performs test tokenization to verify functionality 5. Sets `isRealTiktoken` flag based on tiktoken availability ```javascript Example theme={null} const service = new TokenizationService(); await service.waitForInitialization(); if (service.isInitialized) { console.log('Tokenizer ready!'); console.log('Using real tiktoken:', service.isRealTiktoken); } ``` If tiktoken fails to load, the service automatically uses fallback tokenization. Token IDs will be marked as approximate. ### waitForInitialization() Waits for the tokenizer to complete initialization. ```javascript theme={null} async waitForInitialization() ``` Resolves when initialization is complete ```javascript Example theme={null} const service = new TokenizationService(); // Wait before tokenizing await service.waitForInitialization(); // Now safe to tokenize const result = await service.tokenizeText('Hello world', 'gpt-4o'); ``` ### tokenizeText() Tokenizes text using the appropriate method for the specified model. ```javascript theme={null} async tokenizeText(text, modelId) ``` The text to tokenize Model identifier (e.g., "gpt-4o", "claude-3.5-sonnet") Object containing tokens array and count **Return value structure:** Array of token objects with text, type, ID, and metadata Total number of tokens **Token object structure:** The actual text of the token Token type: "palabra", "subword", "number", "punctuation", "special", "espacio\_en\_blanco" Unique identifier for the token (e.g., "token\_0") Numeric token ID from tiktoken (or approximation) Zero-based position in the token sequence True if token ID is approximate (fallback mode) ```javascript Basic Usage theme={null} const service = new TokenizationService(); await service.waitForInitialization(); const result = await service.tokenizeText('Hello world!', 'gpt-4o'); console.log('Token count:', result.count); console.log('Tokens:', result.tokens); // Output: // Token count: 3 // Tokens: [ // { text: 'Hello', type: 'palabra', tokenId: 9906, ... }, // { text: ' world', type: 'palabra_con_espacio', tokenId: 1917, ... }, // { text: '!', type: 'punctuation', tokenId: 0, ... } // ] ``` ```javascript Model Comparison theme={null} const service = new TokenizationService(); await service.waitForInitialization(); const text = 'The quick brown fox jumps over the lazy dog'; const gptResult = await service.tokenizeText(text, 'gpt-4o'); const claudeResult = await service.tokenizeText(text, 'claude-3.5-sonnet'); const llamaResult = await service.tokenizeText(text, 'llama-3.1-70b'); console.log('GPT-4o tokens:', gptResult.count); console.log('Claude tokens:', claudeResult.count); console.log('Llama tokens:', llamaResult.count); ``` For models using `cl100k_base` encoding (GPT-4, Claude, etc.), token IDs are exact. For other models, counts are adjusted using model-specific ratios. ### createTokensFromEncoding() Creates visual token objects from tiktoken encoding. ```javascript theme={null} createTokensFromEncoding(text, encoded, modelId) ``` Original input text Array of token IDs from tiktoken.encode() Model identifier Array of token objects for visualization **Process:** 1. Iterates through each encoded token ID 2. Decodes individual tokens to get exact text 3. Determines token type based on content 4. Creates token object with metadata 5. Marks tokens as approximate if using fallback ```javascript Example theme={null} const service = new TokenizationService(); await service.waitForInitialization(); const text = 'Hello world'; const encoded = service.encoder.encode(text); const tokens = service.createTokensFromEncoding(text, encoded, 'gpt-4o'); console.log(tokens); // [ // { // text: 'Hello', // type: 'palabra', // id: 'token_0', // tokenId: 9906, // index: 0, // isApproximate: false // }, // ... // ] ``` ### fallbackTokenization() Provides tokenization when tiktoken is unavailable. ```javascript theme={null} fallbackTokenization(text, modelId) ``` Text to tokenize Model identifier Object with tokens array and count **Fallback strategy:** * Splits text into words and whitespace segments * Uses heuristics to approximate token boundaries * Generates deterministic IDs based on content * Marks all tokens as `isApproximate: true` Fallback tokenization provides approximate results. Token IDs will not match actual tiktoken IDs but counts are reasonably accurate. ### splitWordIntoTokens() Splits a word into smaller tokens simulating tiktoken behavior. ```javascript theme={null} splitWordIntoTokens(word, startIndex) ``` Word to split into tokens Starting token index Array of token objects **Algorithm:** * Words ≤3 characters: single token * Longer words: split based on \~2.8 characters per token ratio * First chunk marked as "palabra", subsequent as "subword" * Dynamic chunk sizing based on remaining characters ```javascript Example theme={null} const service = new TokenizationService(); const tokens = service.splitWordIntoTokens('tokenization', 0); console.log(tokens); // [ // { text: 'token', type: 'palabra', ... }, // { text: 'iza', type: 'subword', ... }, // { text: 'tion', type: 'subword', ... } // ] ``` ### determineTokenType() Determines the type of a token based on its content. ```javascript theme={null} determineTokenType(text) ``` Token text to classify Token type: "number", "punctuation", "special", or "palabra" **Classification rules:** Token contains only digits: `^\d+$` ```javascript theme={null} determineTokenType('123') // => 'number' ``` Token is only punctuation marks: `^[.,!?;:'"()\[\]{}]+$` ```javascript theme={null} determineTokenType('.,!') // => 'punctuation' ``` Token contains only special characters: `^[^\w\s]+$` ```javascript theme={null} determineTokenType('@#$') // => 'special' ``` Default for word-like tokens ```javascript theme={null} determineTokenType('hello') // => 'palabra' ``` ### createDeterministicId() Creates a deterministic numeric ID for fallback tokens. ```javascript theme={null} createDeterministicId(text, index) ``` Token text Token index Deterministic ID in range 10000-109999 **Algorithm:** 1. Generates simple hash from character codes 2. Combines with index for uniqueness 3. Normalizes to 5-digit range (10000-109999) ```javascript Example theme={null} const service = new TokenizationService(); const id1 = service.createDeterministicId('hello', 0); const id2 = service.createDeterministicId('hello', 1); const id3 = service.createDeterministicId('world', 0); console.log(id1); // e.g., 45712 console.log(id2); // e.g., 46712 (same text, different index) console.log(id3); // e.g., 52341 (different text) ``` ### getTokenizerName() Returns a human-readable name for a tokenizer encoding. ```javascript theme={null} getTokenizerName(encoding) ``` Encoding identifier (e.g., "cl100k\_base") Display name for the tokenizer ```javascript Example theme={null} const service = new TokenizationService(); console.log(service.getTokenizerName('o200k_base')); // "Tokenizador GPT-4o" console.log(service.getTokenizerName('cl100k_base')); // "Tokenizador GPT-4" console.log(service.getTokenizerName('p50k_base')); // "Tokenizador GPT-3" ``` ### getAlgorithmName() Returns a description of the tokenization algorithm for a model. ```javascript theme={null} getAlgorithmName(modelId) ``` Model identifier Algorithm description ```javascript Example theme={null} const service = new TokenizationService(); console.log(service.getAlgorithmName('gpt-4o')); // "o200k_base (GPT Más Reciente)" console.log(service.getAlgorithmName('claude-3.5-sonnet')); // "Tokenización Claude (~20% más tokens)" console.log(service.getAlgorithmName('llama-3.1-70b')); // "Tokenización Llama (~15% menos tokens)" ``` ## Token Types The service classifies tokens into these categories: Standard word token Part of a longer word Word with leading space Numeric token Punctuation marks Special characters Whitespace Decode failure ## Usage Examples ```javascript Basic Tokenization theme={null} const service = new TokenizationService(); await service.waitForInitialization(); const result = await service.tokenizeText( 'Hello, world!', 'gpt-4o' ); console.log(`Tokenized into ${result.count} tokens`); result.tokens.forEach(token => { console.log(`"${token.text}" [${token.type}] ID: ${token.tokenId}`); }); ``` ```javascript Check Tiktoken Status theme={null} const service = new TokenizationService(); await service.waitForInitialization(); if (service.isRealTiktoken) { console.log('✓ Using real tiktoken - IDs are accurate'); } else { console.log('⚠ Using fallback - IDs are approximate'); } ``` ```javascript Model-Specific Tokenization theme={null} const service = new TokenizationService(); await service.waitForInitialization(); const models = ['gpt-4o', 'claude-3.5-sonnet', 'llama-3.1-70b']; const text = 'This is a test sentence for tokenization.'; for (const model of models) { const result = await service.tokenizeText(text, model); console.log(`${model}: ${result.count} tokens`); } ``` ## Model Support The service supports multiple encoding strategies: **OpenAI:** GPT-4, GPT-4 Turbo, GPT-3.5 Turbo\ **Anthropic:** Claude 3 Opus, Claude 3.5 Sonnet\ **Meta:** Llama 3.1 (with ratio adjustment) Uses exact tiktoken encoding with model-specific token ratios. **OpenAI:** GPT-4o, GPT-4o Mini Uses newer tokenizer with improved efficiency. **Google:** Gemini (SentencePiece approximation)\ **Mistral:** Mistral models (ratio-based approximation)\ **Cohere:** Command models (ratio-based approximation) Uses fallback with model-specific token ratios. ## Error Handling ```javascript Handle Initialization Failure theme={null} const service = new TokenizationService(); await service.waitForInitialization(); if (!service.isInitialized) { console.error('Tokenization service failed to initialize'); // Service will use fallback mode automatically } try { const result = await service.tokenizeText('test', 'gpt-4o'); console.log('Tokenization successful:', result); } catch (error) { console.error('Tokenization error:', error); } ``` The service gracefully falls back to approximate tokenization if tiktoken fails to load. Your application continues working with slightly reduced accuracy. ## See Also Main application orchestrator Calculate costs and statistics View all 48 supported models Understand the system design