# TSDocs Context Optimization Plan ## Problem Statement For large TypeScript projects, the context generated for AI-based documentation creation becomes too large, potentially exceeding even o4-mini's 200K token limit. This affects the ability to effectively generate: - Project documentation (README.md) - API descriptions and keywords - Commit messages and changelogs Current implementation simply includes all TypeScript files and key project files, but lacks intelligent selection, prioritization, or content reduction mechanisms. ## Analysis of Approaches ### 1. Smart Content Selection **Description:** Intelligently select only files that are necessary for the specific task being performed, using heuristic rules. **Advantages:** - Simple to implement - Predictable behavior - Can be fine-tuned for different operations **Disadvantages:** - Requires manual tuning of rules - May miss important context in complex projects - Static approach lacks adaptability **Implementation Complexity:** Medium ### 2. File Prioritization **Description:** Rank files by relevance using git history, file size, import/export analysis, and relationship to the current task. **Advantages:** - Adaptively includes the most relevant files first - Maintains context for frequently changed or central files - Can leverage git history for additional signals **Disadvantages:** - Complexity in determining accurate relevance scores - Requires analyzing project structure - May require scanning imports/exports for dependency analysis **Implementation Complexity:** High ### 3. Chunking Strategy **Description:** Process the project in logical segments, generating intermediate results that are then combined to create the final output. **Advantages:** - Can handle projects of any size - Focused context for each specific part - May improve quality by focusing on specific areas deeply **Disadvantages:** - Complex orchestration of multiple AI calls - Challenge in maintaining consistency across chunks - May increase time and cost for processing **Implementation Complexity:** High ### 4. Dynamic Context Trimming **Description:** Automatically reduce context by removing non-essential code while preserving structure. Techniques include: - Removing implementation details but keeping interfaces and type definitions - Truncating large functions while keeping signatures - Removing comments and whitespace (except JSDoc) - Keeping only imports/exports for context files **Advantages:** - Preserves full project structure - Flexible token usage based on importance - Good balance between completeness and token efficiency **Disadvantages:** - Potential to remove important implementation details - Risk of missing context needed for specific tasks - Complex rules for what to trim vs keep **Implementation Complexity:** Medium ### 5. Embeddings-Based Retrieval **Description:** Create vector embeddings of project files and retrieve only the most relevant ones for a specific task using semantic similarity. **Advantages:** - Highly adaptive to different types of requests - Leverages semantic understanding of content - Can scale to extremely large projects **Disadvantages:** - Requires setting up and managing embeddings database - Added complexity of running vector similarity searches - Higher resource requirements for maintaining embeddings **Implementation Complexity:** Very High ### 6. Task-Specific Contexts **Description:** Create separate optimized contexts for different tasks (readme, commit messages, etc.) with distinct file selection and processing strategies. **Advantages:** - Highly optimized for each specific task - Efficient token usage for each operation - Improved quality through task-focused contexts **Disadvantages:** - Maintenance of multiple context building strategies - More complex configuration - Potential duplication in implementation **Implementation Complexity:** Medium ### 7. Recursive Summarization **Description:** Summarize larger files first, then include these summaries in the final context along with smaller files included in full. **Advantages:** - Can handle arbitrary project sizes - Preserves essential information from all files - Balanced approach to token usage **Disadvantages:** - Quality loss from summarization - Increased processing time from multiple AI calls - Complex orchestration logic **Implementation Complexity:** High ## Implementation Strategy We propose a phased implementation approach, starting with the most impactful and straightforward approaches, then building toward more complex solutions as needed: ### Phase 1: Foundation (1-2 weeks) 1. **Implement Dynamic Context Trimming** - Create a `ContextProcessor` class that takes SmartFile objects and applies trimming rules - Implement configurable trimming rules (remove implementations, keep signatures) - Add a configuration option to control trimming aggressiveness - Support preserving JSDoc comments while removing other comments 2. **Enhance Token Monitoring** - Track token usage per file to identify problematic files - Implement token budgeting to stay within limits - Add detailed token reporting for optimization ### Phase 2: Smart Selection (2-3 weeks) 3. **Implement Task-Specific Contexts** - Create specialized context builders for readme, commit messages, and descriptions - Customize file selection rules for each task - Add configuration options for task-specific settings 4. **Add Smart Content Selection** - Implement heuristic rules for file importance - Create configuration for inclusion/exclusion patterns - Add ability to focus on specific directories or modules ### Phase 3: Advanced Techniques (3-4 weeks) 5. **Implement File Prioritization** - Add git history analysis to identify frequently changed files - Implement dependency analysis to identify central files - Create a scoring system for file relevance 6. **Add Optional Recursive Summarization** - Implement file summarization for large files - Create a hybrid approach that mixes full files and summaries - Add configuration to control summarization thresholds ### Phase 4: Research-Based Approaches (Future Consideration) 7. **Research and Evaluate Embeddings-Based Retrieval** - Prototype embeddings creation for TypeScript files - Evaluate performance and accuracy - Implement if benefits justify the complexity 8. **Explore Chunking Strategies** - Research effective chunking approaches for documentation - Prototype and evaluate performance - Implement if benefits justify the complexity ## Technical Design ### Core Components 1. **ContextBuilder** - Enhanced version of current ProjectContext ```typescript interface IContextBuilder { buildContext(): Promise; getTokenCount(): number; setContextMode(mode: 'normal' | 'trimmed' | 'summarized'): void; setTokenBudget(maxTokens: number): void; setPrioritizationStrategy(strategy: IPrioritizationStrategy): void; } ``` 2. **FileProcessor** - Handles per-file processing and trimming ```typescript interface IFileProcessor { processFile(file: SmartFile): Promise; setProcessingMode(mode: 'full' | 'trim' | 'summarize'): void; getTokenCount(): number; } ``` 3. **PrioritizationStrategy** - Ranks files by importance ```typescript interface IPrioritizationStrategy { rankFiles(files: SmartFile[], context: string): Promise; setImportanceMetrics(metrics: IImportanceMetrics): void; } ``` 4. **TaskContextFactory** - Creates optimized contexts for specific tasks ```typescript interface ITaskContextFactory { createContextForReadme(projectDir: string): Promise; createContextForCommit(projectDir: string, diff: string): Promise; createContextForDescription(projectDir: string): Promise; } ``` ### Configuration Options The system will support configuration via a new section in `npmextra.json`: ```json { "tsdoc": { "context": { "maxTokens": 190000, "defaultMode": "dynamic", "taskSpecificSettings": { "readme": { "mode": "full", "includePaths": ["src/", "lib/"], "excludePaths": ["test/", "examples/"] }, "commit": { "mode": "trimmed", "focusOnChangedFiles": true }, "description": { "mode": "summarized", "includePackageInfo": true } }, "trimming": { "removeImplementations": true, "preserveInterfaces": true, "preserveTypeDefs": true, "preserveJSDoc": true, "maxFunctionLines": 5 } } } } ``` ## Cost-Benefit Analysis ### Cost Considerations 1. **Development costs** - Initial implementation of foundational components (~30-40 hours) - Testing and validation across different project sizes (~10-15 hours) - Documentation and configuration examples (~5 hours) 2. **Operational costs** - Potential increased processing time for context preparation - Additional API calls for summarization or embeddings approaches - Monitoring and maintenance of the system ### Benefits 1. **Scalability** - Support for projects of any size, up to and beyond o4-mini's 200K token limit - Future-proof design that can adapt to different models and token limits 2. **Quality improvements** - More focused contexts lead to better AI outputs - Task-specific optimization improves relevance - Consistent performance regardless of project size 3. **User experience** - Predictable behavior for all project sizes - Transparent token usage reporting - Configuration options for different usage patterns ## First Deliverable For immediate improvements, we recommend implementing Dynamic Context Trimming and Task-Specific Contexts first, as these offer the best balance of impact and implementation complexity. ### Implementation Plan for Dynamic Context Trimming 1. Create a basic `ContextTrimmer` class that processes TypeScript files: - Remove function bodies but keep signatures - Preserve interface and type definitions - Keep imports and exports - Preserve JSDoc comments 2. Integrate with the existing ProjectContext class: - Add a trimming mode option - Apply trimming during the context building process - Track and report token savings 3. Modify the CLI to support trimming options: - Add a `--trim` flag to enable trimming - Add a `--trim-level` option for controlling aggressiveness - Show token usage with and without trimming This approach could reduce token usage by 40-70% while preserving the essential structure of the codebase, making it suitable for large projects while maintaining high-quality AI outputs.