Files
shopify-ai-backup/TOKEN_TRACKING_IMPROVEMENTS.md

10 KiB

Token Tracking Improvements

Summary

Fixed all identified token tracking issues to ensure accurate token reporting from OpenCode. The system now provides comprehensive logging at every extraction attempt with detailed failure reasons, making it clear when estimation is being used and why.

Issues Fixed

1. Multiple Fallback Layers Obscure Real Usage

Before: Each fallback layer failed silently, making it unclear which method provided the count. After:

  • Added tokenExtractionLog array that tracks every extraction attempt
  • Each method logs success/failure with detailed reasons
  • Final log shows complete extraction history
  • Token source is explicitly tracked: 'stream', 'response-openai', 'response-google', 'response-direct', 'output-json', 'output-text', 'session', or 'estimate-improved'

2. Token Extraction from OpenCode Output is Unreliable

Before: Relied on regex pattern matching without verification. After:

  • Added validation for all extracted token values
  • Logs exactly which pattern matched and what value was found
  • If pattern matches but validation fails, it's logged with reason
  • Tracks ANSI stripping and format issues in logs

3. Session Info API Not Guaranteed to Work

Before: Multiple command variations tried but failures were generic. After:

  • Enhanced getOpencodeSessionTokenUsage() with comprehensive logging:
    • Logs each command attempt with full command string
    • Logs response details (stdout/stderr sample)
    • Logs JSON parsing attempts and available keys
    • Logs validation of extracted values
    • Returns detailed attempt history
  • Uses emojis for visual clarity: 🔍 (searching), ✓ (success), ✗ (failed), ⚠️ (warning)

4. Streaming Capture is Opportunistic

Before: Silently failed if tokens didn't appear during streaming. After:

  • Logs when stream capture is attempted
  • Logs if message.opencodeTokensUsed is not set
  • Validates captured tokens before accepting
  • Tracks whether streaming was available

5. No Verification or Retry Logic

Before: No validation that extracted tokens are reasonable. After:

  • Added validateTokenCount() function that checks:
    • Token is a positive finite number
    • Token doesn't exceed reasonable maximum (1M)
    • Chars-per-token ratio is within reasonable bounds (0.5-15)
  • All extraction methods validate tokens before accepting
  • Failed validations are logged with specific reason

6. Estimation is Always Used as Fallback

Before: Impossible to know when real tracking failed vs when estimation was needed. After:

  • Estimation only used after all extraction methods fail
  • Logs comprehensive details when falling back to estimation:
    • Input/output token breakdown
    • Content lengths
    • Calculation formula
    • Complete extraction attempt history
  • Emits console warning: [TOKEN_TRACKING] ⚠️ ESTIMATION USED: with full context
  • Token source explicitly marked as 'estimate-improved'

7. Provider Usage Recording May Not Match Actual Consumption

Before: Provider limits based on potentially inaccurate estimates. After:

  • Provider usage recording now happens after validation
  • Token source is tracked alongside usage
  • Logs clearly indicate when estimated vs actual tokens are recorded
  • Can audit provider usage accuracy via token source

8. Early Termination/Error Cases Default to Estimation

Before: Error paths skipped real token counting. After:

  • Extraction attempts still occur even in error cases
  • Logs show why extraction failed (missing session, no output, etc.)
  • Estimation details include context about error condition
  • Can distinguish "no tokens found" from "extraction failed"

New Logging Structure

Successful Token Extraction

✓ Token extraction: Using token usage captured from stream
  { tokensUsed: 1234, tokenSource: 'stream', messageId: 'msg_abc' }
✅ Token extraction successful
  { tokensUsed: 1234, tokenSource: 'stream', extractionLog: [...] }

Failed Extraction → Estimation

✗ Token extraction: Stream tokens failed validation
  { tokens: 9999999, reason: 'tokens exceeds reasonable maximum of 1000000' }
✗ Token extraction: No token fields found in parsed response
  { availableKeys: ['reply', 'model'], checkedFields: [...] }
✗ Token extraction: Session query returned 0 tokens
⚠️ Token extraction: All methods failed, will fall back to estimation
  { extractionLog: [...full history...], hadStream: false, hadParsed: true }
[TOKEN_TRACKING] ⚠️ ESTIMATION USED: {
  "messageId": "msg_abc",
  "model": "gpt-4",
  "estimatedTokens": 150,
  "inputTokens": 100,
  "outputTokens": 80,
  "reason": "All token extraction methods failed"
}

Session Query Logging

🔍 getOpencodeSessionTokenUsage: Starting session token query
  { sessionId: 'ses_123', candidateCount: 5 }
  → Trying: opencode session info --id ses_123 --json
  ← Response received { hasStdout: true, stdoutLength: 245 }
  ✓ JSON parse successful { parsedKeys: ['session', 'tokens'] }
✅ getOpencodeSessionTokenUsage: Successfully extracted tokens from JSON
  { tokens: 1234, command: 'session info --id ses_123 --json' }

Code Changes

Files Modified

  • chat/server.js - Main implementation file

Functions Enhanced

  1. sendToOpencode() (lines ~5990-6160)

    • Added tokenExtractionLog array
    • Added validation for all extraction methods
    • Enhanced logging with ✓/✗ indicators
    • Returns { ..., tokenSource, tokenExtractionLog }
  2. getOpencodeSessionTokenUsage() (lines ~9651-9750)

    • Complete rewrite with detailed logging
    • Tracks all command attempts
    • Validates extracted values
    • Returns attempt history in logs
  3. sendToOpencodeWithFallback() (lines ~6324-6430)

    • Uses improved token extraction from sendToOpencode()
    • Adds validation and logging at fallback level
    • Emits console warnings for estimation
    • Passes through tokenSource and tokenExtractionLog
  4. processMessage() (lines ~6630-6730)

    • Uses tokenSource and tokenExtractionLog from result
    • Adds comprehensive estimation logging
    • Tracks extraction attempts across entire message flow
    • Emits detailed console warnings
  5. NEW: validateTokenCount() (lines ~4271-4330)

    • Validates token is positive finite number
    • Checks against reasonable maximum (1M)
    • Validates chars-per-token ratio (0.5-15)
    • Returns { valid, reason, adjustedTokens? }

Testing Instructions

1. Monitor Logs for Token Extraction

Start the server and watch logs:

cd chat
npm start

Send a message and look for token extraction logs:

  • Should see 🔍 for search attempts
  • Should see ✓ for successful extraction
  • Should see ✗ for failed attempts
  • Should see ⚠️ if estimation is used

2. Verify Token Source Tracking

Check console output for each message:

[USAGE] processMessage: recording tokens user=user_123 tokens=150 model=gpt-4 source=stream

Token source should be one of:

  • stream - Captured during CLI streaming
  • response-openai - From OpenAI format response
  • response-google - From Google/Gemini format
  • response-direct - From direct token fields
  • output-json - Parsed from JSON in output
  • output-text - Parsed from text patterns
  • session - From session info query
  • estimate-improved - Estimation fallback

3. Test Estimation Scenarios

To verify estimation logging:

  1. Use a model that doesn't report tokens
  2. Disable session info (remove workspace dir)
  3. Check for console warning:
[TOKEN_TRACKING] ⚠️ ESTIMATION USED: {
  "messageId": "...",
  "estimatedTokens": 150,
  "inputTokens": 100,
  "outputTokens": 80,
  "reason": "All token extraction methods failed",
  "extractionLog": [...]
}

4. Verify Validation

Test validation by checking logs for:

  • Very large token counts (> 1M) → should be rejected
  • Suspicious chars-per-token ratios → should be logged
  • Token validation failures → should fall back to next method

5. Session Info Query Testing

Check session query logs:

# Look for detailed session query logs
grep "getOpencodeSessionTokenUsage" chat/logs/*.log

Should see:

  • All command attempts
  • Response samples
  • JSON parsing results
  • Validation results

Debugging Guide

When Tokens Are Estimated

If you see estimation warnings, check the extractionLog field to see why:

{
  "extractionLog": [
    { "method": "stream", "success": false, "reason": "message.opencodeTokensUsed not set during streaming" },
    { "method": "parsed_response", "success": false, "reason": "no parsed response available" },
    { "method": "text_parsing", "success": false, "reason": "no finalOutput available" },
    { "method": "session_query", "success": false, "reason": "no opencodeSessionId" }
  ]
}

This tells you exactly which methods were tried and why each failed.

Common Failure Reasons

Stream capture fails:

  • message.opencodeTokensUsed not set during streaming - OpenCode didn't output tokens in expected format during streaming

Parsed response fails:

  • None of the expected token fields found - Response doesn't have standard token fields
  • parsed is not an object - Response couldn't be parsed as JSON

Text parsing fails:

  • No JSON token patterns matched - Output doesn't contain JSON token data
  • No text token patterns matched - Output doesn't contain text like "tokens: 123"

Session query fails:

  • no opencodeSessionId - Session ID not available
  • session query returned 0 tokens - Command succeeded but found no tokens
  • Command execution failed - OpenCode CLI command failed (see error details)

Validation fails:

  • tokens exceeds reasonable maximum - Suspiciously high token count
  • chars per token is suspiciously low/high - Token count doesn't match content length

Future Improvements

Potential enhancements (not implemented yet):

  1. Store tokenSource in message record for historical analysis
  2. Add metrics dashboard showing token source distribution
  3. Implement retry logic with exponential backoff for session queries
  4. Add post-completion verification step that double-checks token counts
  5. Query OpenCode session logs/files directly if API calls fail
  6. Implement confidence scoring for extracted tokens
  7. Add alerting when estimation rate exceeds threshold

Files Reference

  • Implementation: chat/server.js
  • This Document: TOKEN_TRACKING_IMPROVEMENTS.md
  • Related Docs:
    • TOKEN_TRACKING_FIXES.md - Previous multiplier fixes
    • TOKEN_USAGE_IMPLEMENTATION.md - Overall system architecture
    • TOKEN_USAGE_COMPLETION_SUMMARY.md - Completion summary