Skip to content

Conversation

marksolly
Copy link

Introduces a new max_line_length parameter to the search_code_advanced tool to address an issue where search results containing very long lines (e.g., from minified JavaScript files) could lead to excessive token usage and unexpected behavior when consumed by Large Language Models (LLMs).

Problem:

When using the search_code_advanced tool with context_lines, if a match is found in a file with extremely long lines (such as a .min.js file), the entire line is returned as context. This can result in a massive, unexpected output, leading to:

  • Token Flooding: The large output consumes a significant number of tokens, which can be costly and inefficient.
  • LLM Input Issues: The excessive input can confuse or overwhelm LLMs, leading to poor or irrelevant responses.
  • Performance Degradation: Processing and transmitting large amounts of unnecessary data can slow down the entire workflow.

Solution:

This PR introduces a max_line_length parameter to the search_code_advanced tool, which defaults to 200 characters. This parameter truncates any line in the search results that exceeds the specified length, appending ... (truncated) to indicate that the line has been shortened.

Key Changes:

  • src/code_index_mcp/server.py: The search_code_advanced tool now accepts a max_line_length parameter with a default value of 200.
  • src/code_index_mcp/services/search_service.py: The search_code method now accepts and passes the max_line_length parameter to the underlying search strategies.
  • src/code_index_mcp/search/base.py: The parse_search_output function now includes logic to truncate lines based on max_line_length. The SearchStrategy abstract base class has been updated to include this parameter in the search method.
  • src/code_index_mcp/search/*.py: All concrete SearchStrategy implementations (ugrep, ripgrep, ag, grep, and basic) have been updated to accept and utilize the max_line_length parameter.

@johnhuang316
Copy link
Owner

Suggestion: Default to No Limit for Backward Compatibility

Great work on this PR! The max_line_length parameter is a valuable addition that solves the token flooding issue with minified files.

However, I'd like to suggest changing the default value from 200 to None (no limit) for better backward compatibility:

Current PR: max_line_length: int = 200
Suggested: max_line_length: int = None

Rationale:

  1. Preserves existing behavior - Users won't see unexpected truncation without explicitly opting in
  2. Follows principle of least surprise - Tools should behave predictably by default
  3. Optional enhancement - Users who need truncation can explicitly set the parameter
  4. Gradual adoption - Can be documented with usage recommendations rather than forced

Implementation:

I've implemented this change along with comprehensive unit tests. The modification ensures:

  • Default behavior remains unchanged (no truncation)
  • All search strategies properly handle max_line_length=None
  • Truncation works correctly when explicitly set
  • Full test coverage for all scenarios

Recommendation for Users:

The feature can be documented with a recommendation like:

Tip: When searching files that may contain very long lines (like minified JS), consider setting max_line_length=200 to prevent excessive token usage.

This approach gives users control while protecting them from unexpected behavior changes.

What do you think about this approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants