Skip to content

Commit ecbf881

Browse files
committed
feat: Implement comprehensive code indexing system with multi-language support
## Overview This commit introduces a complete structured code indexing system (version 3.0) that provides deep code analysis, relationship tracking, and enhanced search capabilities through a JSON-based index format. The system supports 8+ specialized language analyzers plus generic analysis for 20+ additional languages. ## Core Architecture ### IndexBuilder Pipeline ``` IndexBuilder ├── ProjectScanner (file discovery & categorization) ├── LanguageAnalyzerManager (parallel code analysis) ├── RelationshipTracker (cross-file relationships) └── IndexStructureBuilder (final assembly) ``` ### Data Models - **FileInfo**: Basic file metadata with language detection - **FunctionInfo**: Function definitions with parameters, decorators, async flags - **ClassInfo**: Class definitions with inheritance and method lists - **ImportInfo**: Import statements with module relationships - **CodeIndex**: Complete structured index with lookups and relationships ## Language Analysis Support ### Specialized Analyzers (8 languages) - **Python**: AST-based parsing with decorators, async functions, class inheritance - **JavaScript/TypeScript**: ES6 features, arrow functions, imports/exports - **Java**: Annotations, interfaces, package imports, method overrides - **Go**: Structs, goroutines, interfaces, methods - **C/C++**: Functions, classes, namespaces, templates, inheritance - **C#**: Properties, LINQ, async/await, attributes - **Objective-C**: Interfaces, implementations, protocols, categories ### Enhanced Generic Analyzer - **Pattern-Based Analysis**: Extracts functions, classes, imports using common patterns - **Language Detection**: Basic heuristics for Ruby, Rust, Swift, PHP, etc. - **Graceful Fallback**: Provides meaningful analysis for unsupported languages - **Error Resilience**: Continues processing when specific analyzers fail ### File Type Coverage **Total Support**: 60+ file extensions including: - Programming languages: .py, .js, .java, .go, .c, .cpp, .cs, .m, .rb, .php, .swift, .kt, .rs, .scala - Web technologies: .html, .css, .scss, .vue, .svelte, .astro - Configuration: .json, .yaml, .xml, .toml, .ini - Documentation: .md, .rst, .txt - Database: .sql, .cql, .cypher ## Index Structure ### Unified Extension Management - Centralized `SUPPORTED_EXTENSIONS` in constants.py - Consistent filtering across ProjectScanner and LanguageAnalyzerManager - Only files with supported extensions are indexed - Clean separation between code files and special files ### JSON Index Format ```json { "project_metadata": { "name": "project-name", "indexed_at": "timestamp", "total_files": "count", "total_lines": "count" }, "index_metadata": { "version": "3.0", "analysis_time_ms": "timing", "files_with_errors": [], "languages_analyzed": ["detected_languages"] }, "directory_tree": { /* Only supported file types */ }, "files": [ /* Detailed analysis results with line numbers */ ], "lookups": { /* Forward lookup tables */ }, "reverse_lookups": { /* Relationship queries */ }, "special_files": { /* File categorization */ } } ``` ## Advanced Features ### Relationship Tracking - **Function Calls**: Tracks which functions call other functions - **Class Usage**: Monitors class instantiation relationships - **Import Dependencies**: Maps module import relationships - **Reverse Lookups**: Efficient querying of "who calls this function" ### Performance Optimization - **Parallel Processing**: Concurrent file analysis using ThreadPoolExecutor - **Memory Management**: On-demand content reading, structured caching - **Error Handling**: Graceful degradation with comprehensive error reporting - **Incremental Analysis**: Efficient processing of large codebases ### Special File Detection - **Entry Points**: main.py, index.js, Main.java, etc. - **Configuration**: package.json, requirements.txt, pom.xml, etc. - **Documentation**: README files, LICENSE, CHANGELOG - **Build Files**: Dockerfile, Jenkinsfile, CI/CD configurations ## Implementation Details ### Key Components - **constants.py**: Centralized supported extensions and configuration - **ProjectScanner**: File discovery with strict extension filtering - **LanguageAnalyzerManager**: Coordinates analyzers with fallback logic - **GenericAnalyzer**: Pattern-matching for common language constructs - **RelationshipTracker**: Cross-file relationship analysis - **IndexBuilder**: Main pipeline coordinator with error handling ### Testing & Validation - Comprehensive test suite for all language analyzers - Integration tests with real-world multi-language projects - Performance validation and error scenario testing - Mock-based unit testing for complex components ## Backward Compatibility - **API Consistency**: Existing server.py uses same extension list - **Version Detection**: 3.0 format with automatic migration support - **Graceful Degradation**: Enhanced error handling and fallback mechanisms - **Legacy Support**: Maintains compatibility with existing MCP server functionality This implementation provides a solid foundation for advanced code analysis and search capabilities while maintaining excellent performance, extensibility, and strict scope control over indexed content.
1 parent 15cd9b9 commit ecbf881

21 files changed

+37679
-288
lines changed

demo_index.json

Lines changed: 32633 additions & 0 deletions
Large diffs are not rendered by default.

demo_indexing.py

Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Demo script for the new code indexing system.
4+
5+
This script demonstrates the capabilities of the new structured indexing system,
6+
showing how it analyzes code structure, relationships, and provides rich metadata.
7+
"""
8+
9+
import sys
10+
import os
11+
import json
12+
from pathlib import Path
13+
14+
# Add src to path
15+
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'src'))
16+
17+
from code_index_mcp.indexing import IndexBuilder
18+
19+
20+
def demo_indexing():
21+
"""Demonstrate the indexing system on the current project."""
22+
print("🔍 Code Indexing System Demo")
23+
print("=" * 50)
24+
25+
# Build index for current project
26+
project_path = "."
27+
print(f"📁 Analyzing project: {os.path.abspath(project_path)}")
28+
29+
builder = IndexBuilder()
30+
index = builder.build_index(project_path)
31+
32+
# Display project metadata
33+
print(f"\n📊 Project Metadata:")
34+
print(f" Name: {index.project_metadata['name']}")
35+
print(f" Total Files: {index.project_metadata['total_files']}")
36+
print(f" Total Lines: {index.project_metadata['total_lines']}")
37+
print(f" Indexed At: {index.project_metadata['indexed_at']}")
38+
39+
# Display index metadata
40+
print(f"\n🔧 Index Metadata:")
41+
print(f" Version: {index.index_metadata['version']}")
42+
print(f" Analysis Time: {index.index_metadata['analysis_time_ms']}ms")
43+
print(f" Languages: {', '.join(index.index_metadata['languages_analyzed'])}")
44+
# Removed supports field as it was not useful
45+
46+
# Display file analysis
47+
print(f"\n📄 File Analysis:")
48+
python_files = [f for f in index.files if f['language'] == 'python']
49+
print(f" Python files: {len(python_files)}")
50+
51+
# Show some Python files with their functions and classes
52+
for file_info in python_files[:3]: # Show first 3 Python files
53+
print(f" 📝 {file_info['path']}:")
54+
if file_info['functions']:
55+
func_names = [f['name'] for f in file_info['functions']]
56+
print(f" Functions: {', '.join(func_names[:5])}") # Show first 5
57+
if file_info['classes']:
58+
class_names = [c['name'] for c in file_info['classes']]
59+
print(f" Classes: {', '.join(class_names)}")
60+
61+
# Display special files
62+
print(f"\n📋 Special Files:")
63+
for category, files in index.special_files.items():
64+
if files:
65+
print(f" {category.replace('_', ' ').title()}: {len(files)} files")
66+
for file_path in files[:3]: # Show first 3 files in each category
67+
print(f" - {file_path}")
68+
69+
# Display directory structure (simplified)
70+
print(f"\n🌳 Directory Structure:")
71+
def print_tree(tree, indent=0):
72+
for name, subtree in tree.items():
73+
print(" " * indent + f"├── {name}")
74+
if isinstance(subtree, dict):
75+
print_tree(subtree, indent + 1)
76+
77+
# Show only first level to avoid too much output
78+
for name, subtree in list(index.directory_tree.items())[:5]:
79+
print(f"├── {name}")
80+
if isinstance(subtree, dict) and subtree:
81+
for subname in list(subtree.keys())[:3]:
82+
print(f"│ ├── {subname}")
83+
84+
# Display some lookup examples
85+
print(f"\n🔍 Lookup Examples:")
86+
print(f" Total path mappings: {len(index.lookups['path_to_id'])}")
87+
print(f" Total function mappings: {len(index.lookups['function_to_file_id'])}")
88+
print(f" Total class mappings: {len(index.lookups['class_to_file_id'])}")
89+
90+
# Show some function examples
91+
if index.lookups['function_to_file_id']:
92+
print(f" Sample functions:")
93+
for func_name in list(index.lookups['function_to_file_id'].keys())[:5]:
94+
file_id = index.lookups['function_to_file_id'][func_name]
95+
file_path = next(f['path'] for f in index.files if f['id'] == file_id)
96+
print(f" {func_name}{file_path}")
97+
98+
# Display relationship examples
99+
print(f"\n🔗 Relationships:")
100+
reverse_lookups = index.reverse_lookups
101+
102+
if reverse_lookups.get('function_callers'):
103+
print(f" Function call relationships: {len(reverse_lookups['function_callers'])}")
104+
for func_name, callers in list(reverse_lookups['function_callers'].items())[:3]:
105+
caller_names = [c['caller'] for c in callers]
106+
print(f" {func_name} ← called by: {', '.join(caller_names)}")
107+
108+
if reverse_lookups.get('imports_module'):
109+
print(f" Import relationships: {len(reverse_lookups['imports_module'])}")
110+
for module, file_ids in list(reverse_lookups['imports_module'].items())[:3]:
111+
print(f" {module} ← imported by {len(file_ids)} files")
112+
113+
# Show errors if any
114+
if index.index_metadata.get('files_with_errors'):
115+
print(f"\n⚠️ Files with errors: {len(index.index_metadata['files_with_errors'])}")
116+
for error_file in index.index_metadata['files_with_errors'][:3]:
117+
print(f" - {error_file}")
118+
119+
print(f"\n✅ Indexing complete! Index contains {len(index.files)} files.")
120+
121+
# Optionally save the index to a file
122+
save_index = input("\n💾 Save index to file? (y/N): ").lower().strip()
123+
if save_index == 'y':
124+
output_file = "demo_index.json"
125+
with open(output_file, 'w', encoding='utf-8') as f:
126+
f.write(index.to_json())
127+
print(f"📁 Index saved to {output_file}")
128+
print(f" File size: {os.path.getsize(output_file)} bytes")
129+
130+
131+
def analyze_specific_file():
132+
"""Analyze a specific file in detail."""
133+
print("\n🔬 Detailed File Analysis")
134+
print("=" * 30)
135+
136+
# Let's analyze the main server file
137+
server_file = "src/code_index_mcp/server.py"
138+
if not os.path.exists(server_file):
139+
print(f"❌ File not found: {server_file}")
140+
return
141+
142+
# Build index and find the server file
143+
builder = IndexBuilder()
144+
index = builder.build_index(".")
145+
146+
server_info = None
147+
for file_info in index.files:
148+
if file_info['path'] == server_file.replace('\\', '/'):
149+
server_info = file_info
150+
break
151+
152+
if not server_info:
153+
print(f"❌ File not found in index: {server_file}")
154+
return
155+
156+
print(f"📄 File: {server_info['path']}")
157+
print(f" Language: {server_info['language']}")
158+
print(f" Size: {server_info['size']} bytes")
159+
print(f" Lines: {server_info['line_count']}")
160+
161+
print(f"\n🔧 Functions ({len(server_info['functions'])}):")
162+
for func in server_info['functions'][:10]: # Show first 10 functions
163+
params = ', '.join(func['parameters'][:3]) # Show first 3 params
164+
if len(func['parameters']) > 3:
165+
params += '...'
166+
async_marker = "async " if func['is_async'] else ""
167+
decorators = f"@{', @'.join(func['decorators'])} " if func['decorators'] else ""
168+
print(f" {decorators}{async_marker}{func['name']}({params}) [lines {func['line_start']}-{func['line_end']}]")
169+
170+
if func['calls']:
171+
print(f" → calls: {', '.join(func['calls'][:3])}")
172+
if func['called_by']:
173+
print(f" ← called by: {', '.join(func['called_by'][:3])}")
174+
175+
print(f"\n🏗️ Classes ({len(server_info['classes'])}):")
176+
for cls in server_info['classes']:
177+
inheritance = f" extends {cls['inherits_from']}" if cls['inherits_from'] else ""
178+
print(f" {cls['name']}{inheritance} [lines {cls['line_start']}-{cls['line_end']}]")
179+
if cls['methods']:
180+
print(f" Methods: {', '.join(cls['methods'])}")
181+
if cls['instantiated_by']:
182+
print(f" Instantiated by: {', '.join(cls['instantiated_by'])}")
183+
184+
print(f"\n📦 Imports ({len(server_info['imports'])}):")
185+
for imp in server_info['imports'][:10]: # Show first 10 imports
186+
if imp['imported_names']:
187+
names = ', '.join(imp['imported_names'][:3])
188+
if len(imp['imported_names']) > 3:
189+
names += '...'
190+
print(f" from {imp['module']} import {names}")
191+
else:
192+
print(f" import {imp['module']}")
193+
194+
# Show language-specific features
195+
if server_info['language_specific']:
196+
print(f"\n🐍 Python-specific features:")
197+
python_features = server_info['language_specific'].get('python', {})
198+
199+
if python_features.get('decorators'):
200+
print(f" Decorators:")
201+
for func_name, decorators in python_features['decorators'].items():
202+
print(f" {func_name}: {', '.join(decorators)}")
203+
204+
if python_features.get('async_functions'):
205+
print(f" Async functions: {', '.join(python_features['async_functions'])}")
206+
207+
if python_features.get('class_inheritance'):
208+
print(f" Class inheritance:")
209+
for cls_name, base in python_features['class_inheritance'].items():
210+
if base:
211+
print(f" {cls_name}{base}")
212+
213+
214+
if __name__ == "__main__":
215+
try:
216+
demo_indexing()
217+
218+
# Ask if user wants detailed file analysis
219+
detail_analysis = input("\n🔬 Run detailed file analysis? (y/N): ").lower().strip()
220+
if detail_analysis == 'y':
221+
analyze_specific_file()
222+
223+
except KeyboardInterrupt:
224+
print("\n\n👋 Demo interrupted by user")
225+
except Exception as e:
226+
print(f"\n❌ Error during demo: {e}")
227+
import traceback
228+
traceback.print_exc()

src/code_index_mcp/constants.py

Lines changed: 69 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,5 +5,72 @@
55
# Directory and file names
66
SETTINGS_DIR = "code_indexer"
77
CONFIG_FILE = "config.json"
8-
INDEX_FILE = "file_index.pickle"
9-
CACHE_FILE = "content_cache.pickle"
8+
INDEX_FILE = "index.json"
9+
# CACHE_FILE removed - no longer needed with new indexing system
10+
11+
# Supported file extensions for code analysis
12+
# This is the authoritative list used by both old and new indexing systems
13+
SUPPORTED_EXTENSIONS = [
14+
# Core programming languages
15+
'.py', '.pyw', # Python
16+
'.js', '.jsx', '.ts', '.tsx', # JavaScript/TypeScript
17+
'.mjs', '.cjs', # Modern JavaScript
18+
'.java', # Java
19+
'.c', '.cpp', '.h', '.hpp', # C/C++
20+
'.cxx', '.cc', '.hxx', '.hh', # C++ variants
21+
'.cs', # C#
22+
'.go', # Go
23+
'.m', '.mm', # Objective-C
24+
'.rb', # Ruby
25+
'.php', # PHP
26+
'.swift', # Swift
27+
'.kt', '.kts', # Kotlin
28+
'.rs', # Rust
29+
'.scala', # Scala
30+
'.sh', '.bash', '.zsh', # Shell scripts
31+
'.ps1', # PowerShell
32+
'.bat', '.cmd', # Windows batch
33+
'.r', '.R', # R
34+
'.pl', '.pm', # Perl
35+
'.lua', # Lua
36+
'.dart', # Dart
37+
'.hs', # Haskell
38+
'.ml', '.mli', # OCaml
39+
'.fs', '.fsx', # F#
40+
'.clj', '.cljs', # Clojure
41+
'.vim', # Vim script
42+
'.zig', # Zig
43+
44+
# Web and markup
45+
'.html', '.htm', # HTML
46+
'.css', '.scss', '.sass', # Stylesheets
47+
'.less', '.stylus', '.styl', # Style languages
48+
'.md', '.mdx', # Markdown
49+
'.json', '.jsonc', # JSON
50+
'.xml', # XML
51+
'.yml', '.yaml', # YAML
52+
53+
# Frontend frameworks
54+
'.vue', # Vue.js
55+
'.svelte', # Svelte
56+
'.astro', # Astro
57+
58+
# Template engines
59+
'.hbs', '.handlebars', # Handlebars
60+
'.ejs', # EJS
61+
'.pug', # Pug
62+
63+
# Database and SQL
64+
'.sql', '.ddl', '.dml', # SQL
65+
'.mysql', '.postgresql', '.psql', # Database-specific SQL
66+
'.sqlite', '.mssql', '.oracle', # More databases
67+
'.ora', '.db2', # Oracle and DB2
68+
'.proc', '.procedure', # Stored procedures
69+
'.func', '.function', # Functions
70+
'.view', '.trigger', '.index', # Database objects
71+
'.migration', '.seed', '.fixture', # Migration files
72+
'.schema', # Schema files
73+
'.cql', '.cypher', '.sparql', # NoSQL query languages
74+
'.gql', # GraphQL
75+
'.liquibase', '.flyway', # Migration tools
76+
]
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
"""
2+
Code indexing system for the MCP server.
3+
4+
This module provides structured code analysis, relationship tracking,
5+
and enhanced search capabilities through a JSON-based index format.
6+
"""
7+
8+
from .models import (
9+
FileInfo,
10+
FunctionInfo,
11+
ClassInfo,
12+
ImportInfo,
13+
FileAnalysisResult,
14+
CodeIndex
15+
)
16+
17+
from .builder import IndexBuilder
18+
from .scanner import ProjectScanner
19+
from .analyzers import LanguageAnalyzerManager
20+
21+
__all__ = [
22+
'FileInfo',
23+
'FunctionInfo',
24+
'ClassInfo',
25+
'ImportInfo',
26+
'FileAnalysisResult',
27+
'CodeIndex',
28+
'IndexBuilder',
29+
'ProjectScanner',
30+
'LanguageAnalyzerManager'
31+
]
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
"""
2+
Language analyzers for code structure extraction.
3+
4+
This module provides language-specific analyzers that extract functions, classes,
5+
imports, and other code structures from source files.
6+
"""
7+
8+
from .base import LanguageAnalyzer
9+
from .manager import LanguageAnalyzerManager
10+
from .python_analyzer import PythonAnalyzer
11+
from .javascript_analyzer import JavaScriptAnalyzer
12+
from .java_analyzer import JavaAnalyzer
13+
from .go_analyzer import GoAnalyzer
14+
from .c_analyzer import CAnalyzer
15+
from .cpp_analyzer import CppAnalyzer
16+
from .csharp_analyzer import CSharpAnalyzer
17+
from .objective_c_analyzer import ObjectiveCAnalyzer
18+
19+
__all__ = [
20+
'LanguageAnalyzer',
21+
'LanguageAnalyzerManager',
22+
'PythonAnalyzer',
23+
'JavaScriptAnalyzer',
24+
'JavaAnalyzer',
25+
'GoAnalyzer',
26+
'CAnalyzer',
27+
'CppAnalyzer',
28+
'CSharpAnalyzer',
29+
'ObjectiveCAnalyzer'
30+
]

0 commit comments

Comments
 (0)