Skip to content

ivanho-git/algorithm_recommender

Repository files navigation

Algorithm Recommender System

A research-grade algorithm recommendation system that automatically selects the optimal algorithm for sorting, array searching, graph searching, string searching, and tree searching problems. The system combines algorithmic theory (correctness constraints) with machine learning (data-driven selection) to provide intelligent, explainable recommendations.

🎯 Project Overview

This system addresses the algorithm selection problem: given a problem instance, which algorithm should be used for optimal performance?

Key Innovation

Traditional approaches either:

  1. Use rule-based systems (rigid, don't adapt to data)
  2. Use pure ML (may violate correctness constraints)

Our approach is a hybrid system that:

  • Enforces strict correctness constraints (e.g., never recommends Binary Search for unsorted arrays)
  • Uses ML for optimization among valid algorithms
  • Provides human-readable explanations for all recommendations
  • Includes automatic structure inference to detect input type

πŸ“Š Domains Supported

1. Sorting Algorithms

Algorithm Best Case Average Worst Space When to Use
Insertion Sort O(n) O(nΒ²) O(nΒ²) O(1) Small/nearly sorted arrays
Selection Sort O(nΒ²) O(nΒ²) O(nΒ²) O(1) Very small arrays
Merge Sort O(n log n) O(n log n) O(n log n) O(n) Large arrays, stability needed
Quick Sort O(n log n) O(n log n) O(nΒ²) O(log n) General purpose
Heap Sort O(n log n) O(n log n) O(n log n) O(1) Guaranteed performance + in-place

Features Extracted:

  • Input size (log-normalized)
  • Sortedness measure (inversions ratio)
  • Duplicate ratio
  • Distribution characteristics

2. Array Search Algorithms

Algorithm Complexity Constraint When to Use
Linear Search O(n) None Unsorted arrays, small arrays
Binary Search O(log n) Sorted only Large sorted arrays
Jump Search O(√n) Sorted only Sorted arrays, forward-only access
Exponential Search O(log n) Sorted + Uniform Large sorted uniform arrays

Correctness Constraints (STRICT):

  • ❌ Binary/Jump/Exponential search NEVER recommended for unsorted arrays
  • ❌ Exponential search NEVER recommended for non-uniform distributions

3. Graph Search Algorithms

Algorithm Complexity Constraint When to Use
BFS O(V + E) Unweighted only Shortest path in unweighted graphs
Dijkstra O((V+E) log V) Non-negative weights General weighted graphs
Bellman-Ford O(VE) Handles negatives Graphs with negative weights
A* O(E) best Requires heuristic When good heuristic available

Correctness Constraints (STRICT):

  • ❌ BFS NEVER recommended for weighted graphs
  • ❌ Dijkstra NEVER recommended when negative weights exist
  • ❌ A* NEVER recommended without heuristic

4. String Search Algorithms (NEW)

Algorithm Complexity Constraint When to Use
Naive String Search O(nm) None Single pattern, short text
Trie Search O(m) per query Multiple patterns OR prefix queries Multiple patterns, autocomplete

Correctness Constraints (STRICT):

  • ❌ Trie search NEVER recommended for single exact pattern (preprocessing overhead not justified)

5. Tree Search Algorithms (NEW)

Algorithm Complexity Constraint When to Use
BST Search O(h) None Any BST, unknown balance
Balanced Tree Search O(log n) Balanced tree only AVL/Red-Black trees

Correctness Constraints (STRICT):

  • ❌ Balanced tree search NEVER recommended for unbalanced trees (loses O(log n) guarantee)

πŸ—οΈ Architecture

algorithm_recommender/
β”œβ”€β”€ sorting/
β”‚   β”œβ”€β”€ algorithms.py      # Sorting algorithm implementations
β”‚   β”œβ”€β”€ features.py        # Feature extraction
β”‚   β”œβ”€β”€ dataset.py         # Training data generation
β”‚   β”œβ”€β”€ recommender.py     # ML-based recommender
β”‚   β”œβ”€β”€ train.py           # Training script
β”‚   └── test.py            # Test script
β”‚
β”œβ”€β”€ searching/
β”‚   β”œβ”€β”€ array_search/
β”‚   β”‚   β”œβ”€β”€ algorithms.py  # Search algorithm implementations
β”‚   β”‚   β”œβ”€β”€ features.py    # Feature extraction with correctness
β”‚   β”‚   β”œβ”€β”€ dataset.py     # Training data (respects constraints)
β”‚   β”‚   β”œβ”€β”€ recommender.py # Recommender with constraint enforcement
β”‚   β”‚   β”œβ”€β”€ train.py       # Training script
β”‚   β”‚   └── test.py        # Correctness-focused tests
β”‚   β”‚
β”‚   β”œβ”€β”€ graph_search/
β”‚   β”‚   β”œβ”€β”€ algorithms.py  # BFS, Dijkstra, Bellman-Ford, A*
β”‚   β”‚   β”œβ”€β”€ features.py    # Graph feature extraction
β”‚   β”‚   β”œβ”€β”€ dataset.py     # Training data generation
β”‚   β”‚   β”œβ”€β”€ recommender.py # Graph search recommender
β”‚   β”‚   β”œβ”€β”€ train.py       # Training script
β”‚   β”‚   └── test.py        # Path correctness tests
β”‚   β”‚
β”‚   β”œβ”€β”€ string_search/     # NEW: String/pattern matching
β”‚   β”‚   β”œβ”€β”€ algorithms.py  # Naive search, Trie search
β”‚   β”‚   β”œβ”€β”€ features.py    # Text/pattern feature extraction
β”‚   β”‚   β”œβ”€β”€ constraints.py # Trie validity constraints
β”‚   β”‚   β”œβ”€β”€ dataset.py     # Training data generation
β”‚   β”‚   β”œβ”€β”€ recommender.py # String search recommender
β”‚   β”‚   β”œβ”€β”€ train.py       # Training script
β”‚   β”‚   └── test.py        # Correctness tests
β”‚   β”‚
β”‚   └── tree_search/       # NEW: Tree-based searching
β”‚       β”œβ”€β”€ algorithms.py  # BST, AVL tree, search functions
β”‚       β”œβ”€β”€ features.py    # Tree feature extraction (balance, height)
β”‚       β”œβ”€β”€ constraints.py # Balance threshold constraints
β”‚       β”œβ”€β”€ dataset.py     # Training data generation
β”‚       β”œβ”€β”€ recommender.py # Tree search recommender
β”‚       β”œβ”€β”€ train.py       # Training script
β”‚       └── test.py        # Balance-aware tests
β”‚
β”œβ”€β”€ models/                 # Saved trained models
β”‚   β”œβ”€β”€ sorting_model.pkl
β”‚   β”œβ”€β”€ array_search_model.pkl
β”‚   β”œβ”€β”€ graph_search_model.pkl
β”‚   β”œβ”€β”€ string_search_model.pkl  # NEW
β”‚   └── tree_search_model.pkl    # NEW
β”‚
β”œβ”€β”€ utils/                  # Shared utilities
β”‚   β”œβ”€β”€ timing.py          # Benchmarking utilities
β”‚   β”œβ”€β”€ statistics.py      # Statistical helpers
β”‚   β”œβ”€β”€ validation.py      # Input validation
β”‚   └── structure_inference.py  # NEW: Auto-detect input type
β”‚
β”œβ”€β”€ web/                    # Web interface
β”‚   β”œβ”€β”€ app.py             # Flask application
β”‚   β”œβ”€β”€ templates/         # HTML templates
β”‚   └── static/            # CSS/JS assets
β”‚
β”œβ”€β”€ unified_interface.py    # Single entry point for all recommendations
β”œβ”€β”€ __init__.py            # Package initialization
└── README.md              # This file

πŸš€ Quick Start

Installation

# Clone or download the project
cd algorithm_recommender

# Install dependencies
pip install numpy scikit-learn

Basic Usage

from algorithm_recommender import AlgorithmRecommender

# Initialize
recommender = AlgorithmRecommender()

# Train models (first time) or load existing models
recommender.train_all()  # Generates training data and trains models
# OR
recommender.load_models()  # Load pre-trained models

# ===== SORTING =====
result = recommender.recommend_sorting([5, 2, 8, 1, 9, 3, 7])
print(f"Algorithm: {result['algorithm']}")
print(f"Why: {result['explanation']}")
# Output: Algorithm: insertion_sort
# Output: Why: Array is small array, highly unsorted β†’ Insertion Sort selected.

# Actually sort the array
sorted_arr, algo, why = recommender.sort([5, 2, 8, 1, 9])
print(sorted_arr)  # [1, 2, 5, 8, 9]

# ===== ARRAY SEARCH =====
result = recommender.recommend_array_search([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
print(f"Algorithm: {result['algorithm']}")
print(f"Valid options: {result['valid_algorithms']}")
# Output: Algorithm: binary_search
# Output: Valid options: ['linear_search', 'binary_search', 'jump_search', 'exponential_search']

# Search for a value
idx, algo, why = recommender.search_array([1, 2, 3, 4, 5], 3)
print(f"Found at index {idx} using {algo}")

# ===== GRAPH SEARCH =====
graph = {
    0: [(1, 4.0), (2, 1.0)],
    1: [(3, 1.0)],
    2: [(1, 2.0), (3, 5.0)],
    3: []
}
result = recommender.recommend_graph_search(graph)
print(f"Algorithm: {result['algorithm']}")
# Output: Algorithm: dijkstra

# Find shortest path
path, cost, algo, why = recommender.search_graph(graph, start=0, goal=3)
print(f"Path: {path}, Cost: {cost}")
# Output: Path: [0, 2, 1, 3], Cost: 4.0

# ===== STRING SEARCH (NEW) =====
# Single pattern - only naive search valid
result = recommender.recommend_string_search("hello world", ["world"])
print(f"Algorithm: {result['algorithm']}")
# Output: Algorithm: naive_string_search

# Multiple patterns - trie search valid and optimal
result = recommender.recommend_string_search("hello world hello", ["hello", "world", "foo"])
print(f"Algorithm: {result['algorithm']}")
print(f"Valid options: {result['valid_algorithms']}")
# Output: Algorithm: trie_search
# Output: Valid options: ['naive_string_search', 'trie_search']

# ===== TREE SEARCH (NEW) =====
from searching.tree_search.algorithms import BST, AVLTree

# Balanced AVL tree - balanced search valid
avl = AVLTree()
for k in [5, 3, 7, 1, 9, 4, 6]:
    avl.insert(k)
result = recommender.recommend_tree_search(avl)
print(f"Algorithm: {result['algorithm']}")
# Output: Algorithm: balanced_tree_search

# Skewed BST - only bst_search valid
bst = BST()
for k in range(1, 20):  # Sorted insertion creates skew
    bst.insert(k)
result = recommender.recommend_tree_search(bst)
print(f"Algorithm: {result['algorithm']}")
# Output: Algorithm: bst_search

# ===== AUTO-DETECTION (NEW) =====
# The system can automatically detect what kind of problem you have
result = recommender.recommend_auto([5, 2, 8, 1, 9], operation_hint="sort")
print(f"Task: {result['task_type']}, Algorithm: {result['algorithm']}")
# Output: Task: sorting, Algorithm: insertion_sort

πŸ§ͺ Training and Testing

Train All Models

cd algorithm_recommender
python train_all.py  # Trains all 5 models
# OR
python unified_interface.py  # Runs demo which trains if needed

Train Individual Models

# Sorting
python -m sorting.train --num_instances 1000

# Array Search
python -m searching.array_search.train --num_instances 500

# Graph Search
python -m searching.graph_search.train --num_instances 400

# String Search (NEW)
python -m searching.string_search.train --num_instances 300

# Tree Search (NEW)
python -m searching.tree_search.train --num_instances 300

Run Tests

# Sorting tests
python -m sorting.test

# Array search tests (correctness-focused)
python -m searching.array_search.test

# Graph search tests
python -m searching.graph_search.test

# String search tests (NEW)
python -m searching.string_search.test

# Tree search tests (NEW)
python -m searching.tree_search.test

🌐 Web Application

The system includes a professional web interface built with Flask.

Running the Web Application

cd algorithm_recommender

# Option 1: Run directly
python web/app.py

# Option 2: Use the batch file (Windows)
web\run_web.bat

Then open http://localhost:5000 in your browser.

Web Features

  • Task Detection: Automatically detects sorting, array search, graph search, string search, or tree search from natural language
  • Dynamic Input Forms: Different inputs for each task type
  • Feature Visualization: Shows extracted features in real-time
  • Algorithm Recommendation: Displays the recommended algorithm with explanation
  • Confidence Scores: Visual bar chart of algorithm probabilities
  • Algorithm Execution: Execute the recommended algorithm and see results
  • Failure Mode Analysis: Educational feature showing constraint enforcement (see below)

Failure Mode Analysis (Optional Feature)

The web app includes an optional, read-only "Failure Mode Analysis" feature that demonstrates why correctness constraints are essential:

What It Shows:

  1. Algorithm WITHOUT Constraints: What the raw ML model would predict
  2. Why It's Invalid: Explanation of the correctness constraint violated
  3. Corrected Algorithm: The valid algorithm after constraint enforcement

How to Use:

  1. Get a recommendation for any task
  2. In the results section, click "Analyze Constraints" button
  3. View the side-by-side comparison of unconstrained vs constrained choices

Example (Array Search on unsorted array):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Failure Mode Analysis                                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Input: Array [5, 2, 8, 1, 9] (UNSORTED)                      β”‚
β”‚                                                               β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚ β”‚ WITHOUT Constraints β”‚  β”‚ WITH Constraints    β”‚             β”‚
β”‚ β”‚ Binary Search ❌    β”‚  β”‚ Linear Search βœ…    β”‚             β”‚
β”‚ β”‚ [INVALID]           β”‚  β”‚ [GUARANTEED CORRECT]β”‚             β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β”‚                                                               β”‚
β”‚ ⚠️ Why Invalid: Binary Search requires sorted array.         β”‚
β”‚    The input array is NOT sorted, so Binary Search would     β”‚
β”‚    produce incorrect results.                                 β”‚
β”‚                                                               β”‚
β”‚ βœ… Constraint System: Overrode ML choice to Linear Search    β”‚
β”‚    which works correctly on any array.                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This feature helps users understand:

  • Why pure ML is insufficient for algorithm selection
  • How correctness constraints prevent invalid recommendations
  • The value of the hybrid ML + theory approach

API Endpoints

Endpoint Method Description
/ GET Main web interface
/api/detect-task POST Detect task from problem statement
/api/recommend/sorting POST Get sorting recommendation
/api/recommend/array-search POST Get array search recommendation
/api/recommend/graph-search POST Get graph search recommendation
/api/recommend/string-search POST Get string search recommendation
/api/recommend/tree-search POST Get tree search recommendation
/api/execute/sorting POST Execute sorting algorithm
/api/execute/array-search POST Execute search algorithm
/api/execute/graph-search POST Execute graph search algorithm
/api/failure-mode POST Failure mode analysis (unconstrained vs constrained)
/api/algorithms GET List all supported algorithms

Legacy Streamlit App

A Streamlit version is also available:

streamlit run app.py

πŸ”¬ Research Methodology

Feature Extraction

Each domain has a dedicated feature extractor that computes relevant characteristics:

Sorting Features:

  • size_log: Normalized log of array size
  • sortedness: 1 - (inversions / max_inversions)
  • duplicate_ratio: Fraction of duplicate elements
  • is_nearly_sorted: Binary indicator for high sortedness
  • is_reverse_sorted: Binary indicator for descending order

Array Search Features:

  • size_log: Normalized input size
  • is_sorted: Critical for algorithm validity
  • is_uniform: Distribution uniformity measure
  • gap_consistency: Measure of value spacing regularity

Graph Search Features:

  • num_nodes_log, num_edges_log: Size metrics
  • edge_density: E / (V * (V-1))
  • is_weighted: Whether edges have varying weights
  • has_negative_weights: Critical constraint feature
  • has_heuristic: Enables A* selection

Labeling Strategy

Training labels are determined using a hybrid empirical + theoretical approach:

  1. Empirical: Benchmark all valid algorithms on generated instances
  2. Theoretical: Apply domain-knowledge priors to adjust scores
  3. Hybrid Score: (1-Ξ±) * empirical + Ξ± * theoretical

This prevents dominance bias where one algorithm (e.g., QuickSort) always wins empirically but isn't theoretically optimal for all cases.

ML Model

  • Algorithm: RandomForestClassifier
  • Why RandomForest:
    • Handles non-linear relationships
    • Provides feature importance
    • Robust to overfitting
    • Works well with small-medium datasets
  • Hyperparameters: 100 trees, max_depth=12-15, balanced class weights

πŸ“ˆ Evaluation Metrics

During training, we evaluate:

  • Cross-validation accuracy (5-fold)
  • Test accuracy on held-out data
  • Classification report per algorithm
  • Feature importance for interpretability

During testing, we verify:

  • Correctness constraints are never violated
  • Algorithm diversity - different algorithms for different regimes
  • Path/sort correctness - results are actually correct

πŸŽ“ Academic Rigor

This project follows research-level expectations:

  1. Clear separation between rule-based constraints and ML-based optimization
  2. Extensive documentation with docstrings and comments
  3. Reproducibility through random seed control
  4. Theoretical grounding with complexity analysis
  5. Explainability - every recommendation includes justification
  6. Extensibility - easy to add new algorithms

πŸ“š Extending the System

Adding a New Sorting Algorithm

  1. Implement in sorting/algorithms.py:
def tim_sort(arr: List[float]) -> List[float]:
    """Tim Sort implementation."""
    # ... implementation
    return result

SORTING_ALGORITHMS['tim_sort'] = tim_sort
  1. Add theoretical priors in sorting/dataset.py:
PRIORS['tim_sort'] = {
    'is_nearly_sorted': 0.4,
    'size_large': 0.3,
}
  1. Retrain the model.

Adding a New Search Algorithm

Similar process, but ensure you also define correctness constraints in ALGORITHM_CONSTRAINTS.

πŸ”§ Requirements

  • Python 3.8+
  • NumPy
  • scikit-learn

πŸ“„ License

MIT License

πŸ‘₯ Authors

Algorithm Recommender Research Team

πŸ™ Acknowledgments

This project draws inspiration from:

  • Algorithm Selection literature (Rice, 1976)
  • AutoML and meta-learning research
  • Classic algorithms textbooks (CLRS, Sedgewick)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors