Skip to content

Feature Selection via Genetic Algorithm - Project Status & Roadmap

Implementation Status

Phase 1: Core GA Engine (COMPLETE)

  • Core GA engine (genetic_algorithm.py, population.py)
  • Base classes (Evaluator, Selector, Crossover, Mutation)
  • Basic operators (UniformCrossover, BitFlipMutation, TournamentSelector)
  • ML integration (ModelWrapper, AccuracyEvaluator, dataset loader)
  • Integration tests passing on multiple datasets

Phase 2: Extended Operators & Evaluators (COMPLETE)

Crossover Operators (5 implemented): - Uniform crossover - Single-point crossover - Two-point crossover - Multi-point crossover - Base crossover abstract class

Selection Strategies (5 implemented): - Tournament selector - Roulette selector - Ranking selector - Elitism selector - Base selector abstract class

Evaluators (3 implemented): - Accuracy evaluator - F1-score evaluator - Balanced accuracy evaluator

Mutations (1 implemented): - BitFlip mutation - Dynamic mutation (future enhancement)

Phase 4: Benchmarking & Comparison (COMPLETE)

Baseline Methods (6 implemented): - RFE (Recursive Feature Elimination) - LASSO (L1 regularization) - Mutual Information filter - Chi-squared filter - ANOVA F-value - All Features baseline

Statistical Analysis (All implemented): - Wilcoxon signed-rank test - Mann-Whitney U test - Jaccard stability index - Cohen's d effect size - Population diversity metrics

Experiment Framework: - ExperimentRunner class with full functionality - Unified experiment script (run_experiment.py) - Results serialization (pickle format) - Summary reports

Phase 5: Visualization (COMPLETE)

Visualization Functions (9 implemented): - Fitness evolution plot (best/avg/worst) - Diversity evolution plot - Convergence detection visualization - Feature frequency bar chart - Method comparison (box plots) - Feature count comparison (sparsity) - Multi-metric comparison (2×2 grid) - Accuracy vs sparsity (Pareto frontier) - Combined dashboard (3-panel GA view)

Phase 6: Testing & Documentation (COMPLETE)

Testing (280+ tests, 82% coverage): - Unit tests for all operators - Unit tests for evaluators - Integration tests for complete workflows - Edge case tests (empty features, single sample, etc.) - Coverage reports (HTML format)

Documentation: - Main README (concise, updated) - Getting Started guide - Architecture documentation - Tutorial (700 lines, 6 sections) - Module READMEs (11 modules) - Developer guides (core module) - API documentation in docstrings - Sphinx-generated API reference (future)

Experiments (COMPLETE)

Experiment Scripts: - Unified experiment runner with CLI args - GA vs baselines comparison - Multiple datasets (Iris, Wine, Breast Cancer) - Statistical significance testing - Full visualization suite generation

Results Achieved: - Breast Cancer: 98.3% accuracy with 60% feature reduction - Iris: 98.3% accuracy with 50% feature reduction - Wine: 100% accuracy with 50% feature reduction


Phase 3: Advanced Features (PARTIALLY COMPLETE)

Multi-Objective Optimization (Not Implemented)

  • NSGA-II selector implementation
  • Explicit Pareto frontier tracking
  • Crowding distance calculation
  • Note: Accuracy vs sparsity visualization provides basic multi-objective view

Adaptive Mechanisms (Not Implemented)

  • Adaptive crossover (adjusts rate based on diversity)
  • Feature-aware mutation (uses correlation matrix)
  • Dynamic mutation rate (decreases over generations)

Analysis Tools (Partially Implemented)

  • Cross-validation in evaluators
  • Feature correlation analyzer (standalone tool)
  • Preprocessor utilities (normalization, scaling)

Future Enhancements

High Priority Extensions

1. Performance Optimizations

Status: Not implemented Value: High (scalability for large datasets)

  • Parallel fitness evaluation (multiprocessing)
  • Evaluate population in parallel across CPU cores
  • Expected 4-8× speedup on multi-core systems
  • Implementation: multiprocessing.Pool for population evaluation

  • Fitness caching (memoization)

  • Hash chromosome → fitness mapping
  • Avoid re-evaluating duplicate chromosomes
  • Expected 20-30% speedup

  • Incremental feature selection

  • Start with small feature set, grow gradually
  • Faster convergence for high-dimensional data

2. NSGA-II Multi-Objective Optimization

Status: Not implemented Value: Medium-High (better accuracy-sparsity trade-offs)

  • NSGA-II selector with Pareto ranking
  • Crowding distance calculation
  • Pareto frontier tracking and visualization
  • Interactive Pareto solution selection

Use case: Find optimal balance between accuracy and feature count

3. Adaptive Operators

Status: Not implemented Value: Medium (improved convergence)

  • Adaptive mutation rate
  • High rate early (exploration)
  • Low rate late (exploitation)
  • Formula: mutation_rate = initial_rate * (1 - generation/max_generations)

  • Diversity-based crossover

  • Increase crossover rate when diversity is low
  • Prevent premature convergence

  • Feature-aware mutation

  • Higher mutation probability for correlated features
  • Uses feature correlation matrix

Medium Priority Extensions

4. Regression Support

Status: Not implemented Value: Medium (expands use cases)

  • Regression evaluators (MSE, RMSE, R²)
  • Regression model wrappers
  • Regression-specific metrics
  • Example regression datasets

Implementation effort: Low (2-3 hours)

5. Deep Learning Integration

Status: Not implemented Value: Medium (modern ML pipelines)

  • Keras/TensorFlow model wrapper
  • PyTorch model wrapper
  • GPU-accelerated training
  • Neural architecture search integration

Challenges: - Slow fitness evaluation (neural net training) - Requires GPU for practical use

6. Advanced Visualizations

Status: Partially implemented Value: Low-Medium (better insights)

  • Core visualizations complete
  • Feature stability heatmap (across 50+ runs)
  • 3D Pareto frontier (3+ objectives)
  • Interactive Plotly dashboards
  • Animation of GA evolution over generations

Low Priority Extensions

7. Interactive Dashboard

Status: Not implemented Value: Low (nice-to-have)

  • Streamlit web interface
  • Real-time GA visualization
  • Parameter tuning interface
  • Result comparison tool

Implementation effort: High (8-12 hours)

8. AutoML Comparison

Status: Not implemented Value: Low (research/benchmarking)

  • TPOT integration
  • Auto-sklearn comparison
  • Hyperparameter optimization (Optuna)
  • Benchmark against AutoML libraries

9. Distributed Computing

Status: Not implemented Value: Low (for very large-scale experiments)

  • Ray integration for distributed GA
  • Dask for distributed fitness evaluation
  • Cluster deployment scripts

Performance Improvements

  1. Parallel fitness evaluation
  2. Evaluate population across CPU cores
  3. Use Python's multiprocessing.Pool

  4. Fitness caching

  5. Hash chromosome → fitness mapping
  6. Avoid re-evaluating duplicates

  7. Regression evaluators

  8. Add MSE, RMSE, R² evaluators

Further Work

  1. NSGA-II multi-objective optimization
  2. Pareto ranking for accuracy vs sparsity trade-offs

  3. Scalability experiments

  4. Test on high-dimensional datasets (1000+ features)

  5. Operator comparison study

  6. Which crossover/selector works best for which dataset?

Nice to Have

  1. Jupyter notebooks
  2. Interactive tutorial versions

  3. Sphinx documentation

  4. Generated API reference

References

  • Implementation details: See docs/ARCHITECTURE.md
  • Usage guide: See docs/GETTING_STARTED.md
  • Tutorial: See docs/TUTORIAL.md
  • Module docs: See fsga/*/README.md
  • Developer guide: See fsga/core/README_DEV.md