tsdoc/readme.plan.md

11 KiB

TSDocs Context Optimization Plan

Problem Statement

For large TypeScript projects, the context generated for AI-based documentation creation becomes too large, potentially exceeding even o4-mini's 200K token limit. This affects the ability to effectively generate:

  • Project documentation (README.md)
  • API descriptions and keywords
  • Commit messages and changelogs

Current implementation simply includes all TypeScript files and key project files, but lacks intelligent selection, prioritization, or content reduction mechanisms.

Analysis of Approaches

1. Smart Content Selection

Description: Intelligently select only files that are necessary for the specific task being performed, using heuristic rules.

Advantages:

  • Simple to implement
  • Predictable behavior
  • Can be fine-tuned for different operations

Disadvantages:

  • Requires manual tuning of rules
  • May miss important context in complex projects
  • Static approach lacks adaptability

Implementation Complexity: Medium

2. File Prioritization

Description: Rank files by relevance using git history, file size, import/export analysis, and relationship to the current task.

Advantages:

  • Adaptively includes the most relevant files first
  • Maintains context for frequently changed or central files
  • Can leverage git history for additional signals

Disadvantages:

  • Complexity in determining accurate relevance scores
  • Requires analyzing project structure
  • May require scanning imports/exports for dependency analysis

Implementation Complexity: High

3. Chunking Strategy

Description: Process the project in logical segments, generating intermediate results that are then combined to create the final output.

Advantages:

  • Can handle projects of any size
  • Focused context for each specific part
  • May improve quality by focusing on specific areas deeply

Disadvantages:

  • Complex orchestration of multiple AI calls
  • Challenge in maintaining consistency across chunks
  • May increase time and cost for processing

Implementation Complexity: High

4. Dynamic Context Trimming

Description: Automatically reduce context by removing non-essential code while preserving structure. Techniques include:

  • Removing implementation details but keeping interfaces and type definitions
  • Truncating large functions while keeping signatures
  • Removing comments and whitespace (except JSDoc)
  • Keeping only imports/exports for context files

Advantages:

  • Preserves full project structure
  • Flexible token usage based on importance
  • Good balance between completeness and token efficiency

Disadvantages:

  • Potential to remove important implementation details
  • Risk of missing context needed for specific tasks
  • Complex rules for what to trim vs keep

Implementation Complexity: Medium

5. Embeddings-Based Retrieval

Description: Create vector embeddings of project files and retrieve only the most relevant ones for a specific task using semantic similarity.

Advantages:

  • Highly adaptive to different types of requests
  • Leverages semantic understanding of content
  • Can scale to extremely large projects

Disadvantages:

  • Requires setting up and managing embeddings database
  • Added complexity of running vector similarity searches
  • Higher resource requirements for maintaining embeddings

Implementation Complexity: Very High

6. Task-Specific Contexts

Description: Create separate optimized contexts for different tasks (readme, commit messages, etc.) with distinct file selection and processing strategies.

Advantages:

  • Highly optimized for each specific task
  • Efficient token usage for each operation
  • Improved quality through task-focused contexts

Disadvantages:

  • Maintenance of multiple context building strategies
  • More complex configuration
  • Potential duplication in implementation

Implementation Complexity: Medium

7. Recursive Summarization

Description: Summarize larger files first, then include these summaries in the final context along with smaller files included in full.

Advantages:

  • Can handle arbitrary project sizes
  • Preserves essential information from all files
  • Balanced approach to token usage

Disadvantages:

  • Quality loss from summarization
  • Increased processing time from multiple AI calls
  • Complex orchestration logic

Implementation Complexity: High

Implementation Strategy

We propose a phased implementation approach, starting with the most impactful and straightforward approaches, then building toward more complex solutions as needed:

Phase 1: Foundation (1-2 weeks)

  1. Implement Dynamic Context Trimming

    • Create a ContextProcessor class that takes SmartFile objects and applies trimming rules
    • Implement configurable trimming rules (remove implementations, keep signatures)
    • Add a configuration option to control trimming aggressiveness
    • Support preserving JSDoc comments while removing other comments
  2. Enhance Token Monitoring

    • Track token usage per file to identify problematic files
    • Implement token budgeting to stay within limits
    • Add detailed token reporting for optimization

Phase 2: Smart Selection (2-3 weeks)

  1. Implement Task-Specific Contexts

    • Create specialized context builders for readme, commit messages, and descriptions
    • Customize file selection rules for each task
    • Add configuration options for task-specific settings
  2. Add Smart Content Selection

    • Implement heuristic rules for file importance
    • Create configuration for inclusion/exclusion patterns
    • Add ability to focus on specific directories or modules

Phase 3: Advanced Techniques (3-4 weeks)

  1. Implement File Prioritization

    • Add git history analysis to identify frequently changed files
    • Implement dependency analysis to identify central files
    • Create a scoring system for file relevance
  2. Add Optional Recursive Summarization

    • Implement file summarization for large files
    • Create a hybrid approach that mixes full files and summaries
    • Add configuration to control summarization thresholds

Phase 4: Research-Based Approaches (Future Consideration)

  1. Research and Evaluate Embeddings-Based Retrieval

    • Prototype embeddings creation for TypeScript files
    • Evaluate performance and accuracy
    • Implement if benefits justify the complexity
  2. Explore Chunking Strategies

    • Research effective chunking approaches for documentation
    • Prototype and evaluate performance
    • Implement if benefits justify the complexity

Technical Design

Core Components

  1. ContextBuilder - Enhanced version of current ProjectContext

    interface IContextBuilder {
      buildContext(): Promise<string>;
      getTokenCount(): number;
      setContextMode(mode: 'normal' | 'trimmed' | 'summarized'): void;
      setTokenBudget(maxTokens: number): void;
      setPrioritizationStrategy(strategy: IPrioritizationStrategy): void;
    }
    
  2. FileProcessor - Handles per-file processing and trimming

    interface IFileProcessor {
      processFile(file: SmartFile): Promise<string>;
      setProcessingMode(mode: 'full' | 'trim' | 'summarize'): void;
      getTokenCount(): number;
    }
    
  3. PrioritizationStrategy - Ranks files by importance

    interface IPrioritizationStrategy {
      rankFiles(files: SmartFile[], context: string): Promise<SmartFile[]>;
      setImportanceMetrics(metrics: IImportanceMetrics): void;
    }
    
  4. TaskContextFactory - Creates optimized contexts for specific tasks

    interface ITaskContextFactory {
      createContextForReadme(projectDir: string): Promise<string>;
      createContextForCommit(projectDir: string, diff: string): Promise<string>;
      createContextForDescription(projectDir: string): Promise<string>;
    }
    

Configuration Options

The system will support configuration via a new section in npmextra.json:

{
  "tsdoc": {
    "context": {
      "maxTokens": 190000,
      "defaultMode": "dynamic",
      "taskSpecificSettings": {
        "readme": {
          "mode": "full",
          "includePaths": ["src/", "lib/"],
          "excludePaths": ["test/", "examples/"]
        },
        "commit": {
          "mode": "trimmed",
          "focusOnChangedFiles": true
        },
        "description": {
          "mode": "summarized",
          "includePackageInfo": true
        }
      },
      "trimming": {
        "removeImplementations": true,
        "preserveInterfaces": true,
        "preserveTypeDefs": true,
        "preserveJSDoc": true,
        "maxFunctionLines": 5
      }
    }
  }
}

Cost-Benefit Analysis

Cost Considerations

  1. Development costs

    • Initial implementation of foundational components (~30-40 hours)
    • Testing and validation across different project sizes (~10-15 hours)
    • Documentation and configuration examples (~5 hours)
  2. Operational costs

    • Potential increased processing time for context preparation
    • Additional API calls for summarization or embeddings approaches
    • Monitoring and maintenance of the system

Benefits

  1. Scalability

    • Support for projects of any size, up to and beyond o4-mini's 200K token limit
    • Future-proof design that can adapt to different models and token limits
  2. Quality improvements

    • More focused contexts lead to better AI outputs
    • Task-specific optimization improves relevance
    • Consistent performance regardless of project size
  3. User experience

    • Predictable behavior for all project sizes
    • Transparent token usage reporting
    • Configuration options for different usage patterns

First Deliverable

For immediate improvements, we recommend implementing Dynamic Context Trimming and Task-Specific Contexts first, as these offer the best balance of impact and implementation complexity.

Implementation Plan for Dynamic Context Trimming

  1. Create a basic ContextTrimmer class that processes TypeScript files:

    • Remove function bodies but keep signatures
    • Preserve interface and type definitions
    • Keep imports and exports
    • Preserve JSDoc comments
  2. Integrate with the existing ProjectContext class:

    • Add a trimming mode option
    • Apply trimming during the context building process
    • Track and report token savings
  3. Modify the CLI to support trimming options:

    • Add a --trim flag to enable trimming
    • Add a --trim-level option for controlling aggressiveness
    • Show token usage with and without trimming

This approach could reduce token usage by 40-70% while preserving the essential structure of the codebase, making it suitable for large projects while maintaining high-quality AI outputs.