| Title: | Selection of Training Populations by Genetic Algorithm |
|---|---|
| Description: | Advanced genetic algorithms for optimal subset selection in high-dimensional prediction problems. Provides efficient single and multi-objective optimization for training population selection with comprehensive criteria including A-, D-, E-optimality, prediction error variance (PEV), Cook's distance (CD), and distance-based measures. Features multi-criteria convergence detection, restart mechanisms, rank-based selection with pressure control, adaptive mutation, diversity preservation, and numerically stable matrix operations. Includes convergence diagnostics, configurable optimization windows, and both modern clean interfaces with legacy compatibility functions. |
| Authors: | Deniz Akdemir |
| Maintainer: | Deniz Akdemir <[email protected]> |
| License: | GPL-3 |
| Version: | 7.0.2 |
| Built: | 2026-05-28 11:02:05 UTC |
| Source: | https://github.com/denizakdemir/stpga |
Computes the A-optimality criterion, which minimizes the trace of the inverse covariance matrix. Lower values indicate better designs.
a_optimality(train, test = NULL, P, lambda = 1e-06, C = NULL)a_optimality(train, test = NULL, P, lambda = 1e-06, C = NULL)
train |
Vector of training set individual names |
test |
Vector of test set individual names (if NULL, uses training set) |
P |
Prediction matrix with individuals as rows and variables as columns |
lambda |
Ridge regularization parameter for numerical stability (default: 1e-6) |
C |
Contrast matrix for specific linear combinations (default: NULL) |
A-optimality minimizes the average prediction variance. It is equivalent to minimizing the trace of (X'X)^(-1) where X is the design matrix.
Mathematical formula: A = tr((X'X)^(-1))
A-optimality value (trace of inverse covariance matrix)
Fedorov, V.V. (1972). Theory of Optimal Experiments. Academic Press.
# Load wheat genomic data data(WheatData) # Create a subset for faster computation set.seed(123) subset_indices <- sample(1:nrow(Wheat.M), 80) M_subset <- Wheat.M[subset_indices, 1:20] # Extract principal components pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define candidate and test sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 15) candidates <- setdiff(all_individuals, test_set) train_set <- sample(candidates, 20) # Compute A-optimality (minimizes average prediction variance) aopt_value <- a_optimality(train_set, test_set, PC_subset) print(paste("A-optimality value:", round(aopt_value, 6))) # Compare different training set sizes small_train <- sample(candidates, 10) large_train <- sample(candidates, 30) aopt_small <- a_optimality(small_train, test_set, PC_subset) aopt_large <- a_optimality(large_train, test_set, PC_subset) print(paste("Small training (n=10):", round(aopt_small, 6))) print(paste("Large training (n=30):", round(aopt_large, 6))) print(paste("Improvement factor:", round(aopt_small / aopt_large, 2)))# Load wheat genomic data data(WheatData) # Create a subset for faster computation set.seed(123) subset_indices <- sample(1:nrow(Wheat.M), 80) M_subset <- Wheat.M[subset_indices, 1:20] # Extract principal components pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define candidate and test sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 15) candidates <- setdiff(all_individuals, test_set) train_set <- sample(candidates, 20) # Compute A-optimality (minimizes average prediction variance) aopt_value <- a_optimality(train_set, test_set, PC_subset) print(paste("A-optimality value:", round(aopt_value, 6))) # Compare different training set sizes small_train <- sample(candidates, 10) large_train <- sample(candidates, 30) aopt_small <- a_optimality(small_train, test_set, PC_subset) aopt_large <- a_optimality(large_train, test_set, PC_subset) print(paste("Small training (n=10):", round(aopt_small, 6))) print(paste("Large training (n=30):", round(aopt_large, 6))) print(paste("Improvement factor:", round(aopt_small / aopt_large, 2)))
Adaptive mutation rate based on population diversity
adaptive_mutation_rate( base_mutprob, diversity, generation, max_generations, min_rate = 0.01, max_rate = 0.95 )adaptive_mutation_rate( base_mutprob, diversity, generation, max_generations, min_rate = 0.01, max_rate = 0.95 )
base_mutprob |
Base mutation probability |
diversity |
Current population diversity (0-1) |
generation |
Current generation number |
max_generations |
Maximum generations |
min_rate |
Minimum mutation rate (default: 0.01) |
max_rate |
Maximum mutation rate (default: 0.95) |
Adjusted mutation probability
Efficient block matrix multiplication
block_matrix_mult(X, Y, block_size = 500)block_matrix_mult(X, Y, block_size = 500)
X |
First matrix |
Y |
Second matrix |
block_size |
Block size for computation (default: 500) |
Matrix product X
Get cache statistics
cache_stats(cache_env = NULL)cache_stats(cache_env = NULL)
cache_env |
Cache environment |
List with cache statistics
Calculate crowding distance for diversity preservation
calculate_crowding_distance(fitness)calculate_crowding_distance(fitness)
fitness |
Fitness matrix for a single front |
Vector of crowding distances
Calculate population diversity
calculate_diversity(population)calculate_diversity(population)
population |
List of solutions in the population |
Diversity measure (0 to 1, higher = more diverse)
Computes the coefficient of determination as the proportion of variance explained by the model, using the hat matrix diagonal (leverage values).
cd_mean(train, test = NULL, P, lambda = 1e-06, C = NULL, normalized = FALSE)cd_mean(train, test = NULL, P, lambda = 1e-06, C = NULL, normalized = FALSE)
train |
Vector of training set individual names |
test |
Vector of test set individual names (if NULL, uses training set) |
P |
Prediction matrix with individuals as rows and variables as columns |
lambda |
Ridge regularization parameter for numerical stability (default: 1e-6) |
C |
Contrast matrix for specific linear combinations (default: NULL) |
normalized |
Whether to normalize by trace for scale-invariance (default: FALSE) |
Computes R² as the average leverage from the hat matrix, which represents the proportion of variance explained by the model in experimental design.
Formula: R^2 = tr(H)/n where H = X(X'X)^(-1)X' is the hat matrix
This is the standard definition used when response data is not available.
Mean R² value (proportion of variance explained)
Fedorov, V.V. (1972). Theory of Optimal Experiments. Academic Press. Atkinson, A.C., Donev, A.N., Tobias, R.D. (2007). Optimum Experimental Designs.
Computes the coefficient of determination for mixed models, measuring the expected prediction accuracy of random effects (breeding values).
cd_mean_mm(train, test = NULL, P, K, lambda = 1e-06, C = NULL, Vg = 1, Ve = 1)cd_mean_mm(train, test = NULL, P, K, lambda = 1e-06, C = NULL, Vg = 1, Ve = 1)
train |
Vector of training set individual names |
test |
Vector of test set individual names (if NULL, uses training set) |
P |
Prediction matrix with individuals as rows and markers as columns |
K |
Kinship/relationship matrix (symmetric, positive definite) |
lambda |
Ridge regularization parameter for numerical stability (default: 1e-6) |
C |
Contrast matrix for specific linear combinations (default: NULL) |
Vg |
Genetic variance component (default: 1) |
Ve |
Error variance component (default: 1) |
The Coefficient of Determination for BLUP of random effects is defined as: R^2 = 1 - PEV/sigma_u^2
This represents the proportion of genetic variance that can be predicted, or the reliability of breeding value prediction.
Mean R² value for mixed models (higher indicates better prediction accuracy)
Henderson, C.R. (1984). Applications of Linear Models in Animal Breeding.
# Load wheat genomic data data(WheatData) # Create a subset for mixed model demonstration set.seed(159) subset_indices <- sample(1:nrow(Wheat.M), 60) M_subset <- Wheat.M[subset_indices, 1:50] K_subset <- Wheat.K[subset_indices, subset_indices] # Define train and test sets all_individuals <- rownames(M_subset) test_set <- sample(all_individuals, 12) train_set <- sample(setdiff(all_individuals, test_set), 18) # Mixed model R² (reliability) with different heritability scenarios h2_values <- c(0.2, 0.5, 0.8) print("R² (Coefficient of Determination) - Different Heritabilities:") for (h2 in h2_values) { var_comps <- h2_to_variances(h2) cd_mm <- cd_mean_mm(train_set, test_set, M_subset, K_subset, Vg = var_comps$Vg, Ve = var_comps$Ve) print(paste("h² =", h2, "-> R² =", round(cd_mm, 4))) } # Compare training set sizes small_train <- sample(setdiff(all_individuals, test_set), 10) large_train <- sample(setdiff(all_individuals, test_set), 25) var_comps <- h2_to_variances(0.5) # Medium heritability cd_small <- cd_mean_mm(small_train, test_set, M_subset, K_subset, Vg = var_comps$Vg, Ve = var_comps$Ve) cd_large <- cd_mean_mm(large_train, test_set, M_subset, K_subset, Vg = var_comps$Vg, Ve = var_comps$Ve) print(paste("Small training set (n=10): R² =", round(cd_small, 4))) print(paste("Large training set (n=25): R² =", round(cd_large, 4))) print(paste("Improvement with larger set:", round(cd_large - cd_small, 4)))# Load wheat genomic data data(WheatData) # Create a subset for mixed model demonstration set.seed(159) subset_indices <- sample(1:nrow(Wheat.M), 60) M_subset <- Wheat.M[subset_indices, 1:50] K_subset <- Wheat.K[subset_indices, subset_indices] # Define train and test sets all_individuals <- rownames(M_subset) test_set <- sample(all_individuals, 12) train_set <- sample(setdiff(all_individuals, test_set), 18) # Mixed model R² (reliability) with different heritability scenarios h2_values <- c(0.2, 0.5, 0.8) print("R² (Coefficient of Determination) - Different Heritabilities:") for (h2 in h2_values) { var_comps <- h2_to_variances(h2) cd_mm <- cd_mean_mm(train_set, test_set, M_subset, K_subset, Vg = var_comps$Vg, Ve = var_comps$Ve) print(paste("h² =", h2, "-> R² =", round(cd_mm, 4))) } # Compare training set sizes small_train <- sample(setdiff(all_individuals, test_set), 10) large_train <- sample(setdiff(all_individuals, test_set), 25) var_comps <- h2_to_variances(0.5) # Medium heritability cd_small <- cd_mean_mm(small_train, test_set, M_subset, K_subset, Vg = var_comps$Vg, Ve = var_comps$Ve) cd_large <- cd_mean_mm(large_train, test_set, M_subset, K_subset, Vg = var_comps$Vg, Ve = var_comps$Ve) print(paste("Small training set (n=10): R² =", round(cd_small, 4))) print(paste("Large training set (n=25): R² =", round(cd_large, 4))) print(paste("Improvement with larger set:", round(cd_large - cd_small, 4)))
Coefficient of Determination for Mixed Models using heritability
cd_mean_mm_h2( train, test = NULL, P, K, h2, lambda = 1e-06, C = NULL, total_var = 1 )cd_mean_mm_h2( train, test = NULL, P, K, h2, lambda = 1e-06, C = NULL, total_var = 1 )
train |
Vector of training set individual names |
test |
Vector of test set individual names (if NULL, uses training set) |
P |
Prediction matrix with individuals as rows and markers as columns |
K |
Kinship/relationship matrix (symmetric, positive definite) |
h2 |
Heritability (between 0 and 1) |
lambda |
Ridge regularization parameter (default: 1e-6) |
C |
Contrast matrix (default: NULL) |
total_var |
Total variance (default: 1) |
Mean R² value
Check convergence based on convergence history
check_convergence( convergence_history, current_generation, window_size = 20, tolerance = 1e-06 )check_convergence( convergence_history, current_generation, window_size = 20, tolerance = 1e-06 )
convergence_history |
Matrix of convergence metrics over generations |
current_generation |
Current generation number |
window_size |
Window size for checking convergence (default: 20) |
tolerance |
Convergence tolerance (default: 1e-6) |
TRUE if converged, FALSE otherwise
Check dominance between two solutions
check_dominance(solution1, solution2, criteria_types)check_dominance(solution1, solution2, criteria_types)
solution1 |
First solution fitness |
solution2 |
Second solution fitness |
criteria_types |
Optimization types |
1 if solution1 dominates, -1 if solution2 dominates, 0 if non-dominated
Check matrix dimensions for operations
check_matrix_dimensions(...)check_matrix_dimensions(...)
... |
Named matrices to check |
NULL if dimensions are compatible, throws error otherwise
Check memory usage and warn for large computations
check_memory_usage(n_individuals, n_variables, operation = "matrix_mult")check_memory_usage(n_individuals, n_variables, operation = "matrix_mult")
n_individuals |
Number of individuals |
n_variables |
Number of variables/markers |
operation |
Type of operation ("matrix_mult", "eigen", "svd", "chol") |
NULL, but issues warnings for large computations
Multi-criteria convergence detection
check_multi_criteria_convergence( convergence_history, generation, window_size, tolerance )check_multi_criteria_convergence( convergence_history, generation, window_size, tolerance )
convergence_history |
List containing convergence metrics history |
generation |
Current generation number |
window_size |
Window size for convergence check |
tolerance |
Convergence tolerance |
List with convergence status and reason
Check if parameter names follow conventions
check_parameter_naming(param_names)check_parameter_naming(param_names)
param_names |
Vector of parameter names |
List with standardized names and warnings
Memory-efficient crossprod computation
chunked_crossprod(X, Y = NULL, chunk_size = 1000)chunked_crossprod(X, Y = NULL, chunk_size = 1000)
X |
Matrix |
Y |
Optional second matrix (default: X) |
chunk_size |
Chunk size for computation (default: 1000) |
t(X)
Compute matrix operations in chunks for memory efficiency
chunked_matrix_operation(X, FUN, chunk_size = 1000, ...)chunked_matrix_operation(X, FUN, chunk_size = 1000, ...)
X |
Matrix to process |
FUN |
Function to apply to each chunk |
chunk_size |
Maximum chunk size (default: 1000) |
... |
Additional arguments to FUN |
Combined result from all chunks
Clear computation cache
clear_cache(cache_env = NULL)clear_cache(cache_env = NULL)
cache_env |
Cache environment to clear |
Matrix utilities for genetic relationship computations
compute_amatrix(M, pieces = 10, mc.cores = 1)compute_amatrix(M, pieces = 10, mc.cores = 1)
M |
Marker matrix coded as -1,0,1 (individuals x markers) |
pieces |
Number of pieces to split computation (default: 10) |
mc.cores |
Number of cores for parallel processing (default: 1) |
Combined genomic relationship matrix (A matrix)
Deniz Akdemir Compute A matrix in pieces for large marker datasets
Compute convergence metrics for multi-objective optimization
compute_convergence_metrics(fitness)compute_convergence_metrics(fitness)
fitness |
Archive fitness matrix |
Vector of convergence metrics
Compute generation statistics for multi-objective optimization
compute_generation_stats(population, fitness, archive_fitness)compute_generation_stats(population, fitness, archive_fitness)
population |
Current population |
fitness |
Current fitness matrix |
archive_fitness |
Archive fitness matrix |
List with generation statistics
Population statistics computation
compute_population_stats(population, fitness)compute_population_stats(population, fitness)
population |
List of solutions |
fitness |
Vector of fitness values |
List with population statistics
This function computes the core matrix operations used by all optimization criteria. It handles numerical stability and implements the correct PEV formula from literature.
compute_prediction_core( p_train, p_test = NULL, lambda = 1e-06, C = NULL, adaptive_ridge = TRUE )compute_prediction_core( p_train, p_test = NULL, lambda = 1e-06, C = NULL, adaptive_ridge = TRUE )
p_train |
Training prediction matrix (n_train x p) |
p_test |
Test prediction matrix (n_test x p). If NULL, uses training set |
lambda |
Ridge regularization parameter for numerical stability (default: 1e-6) |
C |
Contrast matrix for specific linear combinations (default: NULL) |
adaptive_ridge |
Whether to use adaptive ridge parameter (default: TRUE) |
List containing core matrices and computations
Get convergence history diagnostics from GA results
convergence_diagnostics(ga_result, plot = TRUE)convergence_diagnostics(ga_result, plot = TRUE)
ga_result |
Result from subset_ga function |
plot |
Whether to create diagnostic plots (default: TRUE) |
List with convergence diagnostics
# Load wheat genomic data data(WheatData) # Create a subset for demonstration set.seed(456) subset_indices <- sample(1:nrow(Wheat.M), 100) M_subset <- Wheat.M[subset_indices, 1:30] # Extract principal components pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define candidate and test sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 20) candidates <- setdiff(all_individuals, test_set) # Run genetic algorithm with convergence tracking ga_result <- subset_ga( P = PC_subset, Candidates = candidates, Test = test_set, ntoselect = 15, npop = 25, niterations = 40, criterion = "pev_mean", enable_restart = TRUE, convergence_window_multiplier = 4, verbose = FALSE ) # Generate convergence diagnostics diags <- convergence_diagnostics(ga_result, plot = FALSE) print("Convergence Diagnostics:") print(paste("Converged:", diags$convergence_status$converged)) print(paste("Reason:", diags$convergence_status$reason)) print(paste("Window size used:", diags$window_size)) print(paste("Data points analyzed:", diags$data_points)) print("Stability Metrics:") print(paste("Fitness stability:", round(diags$metrics$fitness_stability, 8))) print(paste("Diversity stability:", round(diags$metrics$diversity_stability, 6))) print(paste("Mean improvement rate:", round(diags$metrics$mean_improvement_rate, 8))) if (length(diags$recommendations) > 0) { print("Recommendations:") for (rec in diags$recommendations) { print(paste("-", rec)) } }# Load wheat genomic data data(WheatData) # Create a subset for demonstration set.seed(456) subset_indices <- sample(1:nrow(Wheat.M), 100) M_subset <- Wheat.M[subset_indices, 1:30] # Extract principal components pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define candidate and test sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 20) candidates <- setdiff(all_individuals, test_set) # Run genetic algorithm with convergence tracking ga_result <- subset_ga( P = PC_subset, Candidates = candidates, Test = test_set, ntoselect = 15, npop = 25, niterations = 40, criterion = "pev_mean", enable_restart = TRUE, convergence_window_multiplier = 4, verbose = FALSE ) # Generate convergence diagnostics diags <- convergence_diagnostics(ga_result, plot = FALSE) print("Convergence Diagnostics:") print(paste("Converged:", diags$convergence_status$converged)) print(paste("Reason:", diags$convergence_status$reason)) print(paste("Window size used:", diags$window_size)) print(paste("Data points analyzed:", diags$data_points)) print("Stability Metrics:") print(paste("Fitness stability:", round(diags$metrics$fitness_stability, 8))) print(paste("Diversity stability:", round(diags$metrics$diversity_stability, 6))) print(paste("Mean improvement rate:", round(diags$metrics$mean_improvement_rate, 8))) if (length(diags$recommendations) > 0) { print("Recommendations:") for (rec in diags$recommendations) { print(paste("-", rec)) } }
Create evaluation cache for storing computed fitness values
create_evaluation_cache()create_evaluation_cache()
Cache object with get and store methods
Create function alias for backward compatibility
create_function_alias(old_name, new_name, envir = .GlobalEnv)create_function_alias(old_name, new_name, envir = .GlobalEnv)
old_name |
Old function name |
new_name |
New standardized function name |
envir |
Environment to create alias in |
Functions to improve computational performance for large datasets
create_progress_bar(total, title = "Processing", verbose = TRUE)create_progress_bar(total, title = "Processing", verbose = TRUE)
total |
Total number of iterations |
title |
Title for the progress bar |
verbose |
Whether to show progress bar |
Progress bar object or NULL
Deniz Akdemir Add progress bar for long-running computations
A unified interface to all optimization criteria. This function automatically dispatches to the appropriate criterion based on the criterion name and parameters.
criterion( train, test = NULL, P, lambda = 1e-06, C = NULL, criterion = "pev_mean", K = NULL, Vg = NULL, Ve = NULL )criterion( train, test = NULL, P, lambda = 1e-06, C = NULL, criterion = "pev_mean", K = NULL, Vg = NULL, Ve = NULL )
train |
Vector of training set individual names |
test |
Vector of test set individual names (if NULL, uses training set) |
P |
Prediction matrix with individuals as rows and variables as columns |
lambda |
Ridge regularization parameter for numerical stability (default: 1e-6) |
C |
Contrast matrix for specific linear combinations (default: NULL) |
criterion |
Character string specifying the criterion to compute |
K |
Kinship matrix for mixed models (required for mixed model criteria) |
Vg |
Genetic variance component for mixed models (default: 1 if K provided) |
Ve |
Error variance component for mixed models (default: 1 if K provided) |
Available criteria:
**Classical Optimality:** - "a_optimality": A-optimality (trace of inverse covariance) - "d_optimality": D-optimality (log determinant) - "e_optimality": E-optimality (minimum eigenvalue) - "g_optimality": G-optimality (maximum prediction variance) - "i_optimality": I-optimality (average prediction variance)
**Prediction-Based:** - "pev_mean": Mean prediction error variance - "pev_mean_normalized": Normalized mean PEV - "pev_max": Maximum prediction error variance - "pev_max_normalized": Normalized maximum PEV - "cd_mean": Coefficient of determination (R²) - "cd_mean_normalized": Normalized R²
**Mixed Models:** - "pev_mean_mm": Mean PEV for mixed models (requires K) - "cd_mean_mm": R² for mixed models (requires K)
**Legacy (backward compatibility):** - "AOPT", "DOPT", "EOPT": Classical optimality criteria - "PEVMEAN", "PEVMEAN2": PEV (normalized version) - "PEVMAX", "PEVMAX2": Maximum PEV (normalized version) - "CDMEAN", "CDMEAN2": R² (normalized version) - "PEVMEANMM", "CDMEANMM": Mixed model versions
Criterion value (interpretation depends on specific criterion)
Crossover, mutation, and selection operators for genetic algorithms
crossover( parent1, parent2, ntoselect, candidates, method = "uniform", bias = 0.5 )crossover( parent1, parent2, ntoselect, candidates, method = "uniform", bias = 0.5 )
parent1 |
First parent solution |
parent2 |
Second parent solution |
ntoselect |
Target size of offspring |
candidates |
Pool of candidate individuals |
method |
Crossover method: "uniform", "twopoint", "proportional" (default: "uniform") |
bias |
Bias toward parent1 (0.5 = equal, >0.5 favors parent1) (default: 0.5) |
Offspring solution
Deniz Akdemir Crossover operation for subset selection
Crowding replacement for diversity preservation
crowding_replacement( offspring, offspring_fitness, population, population_fitness, crowding_factor = 3 )crowding_replacement( offspring, offspring_fitness, population, population_fitness, crowding_factor = 3 )
offspring |
New offspring solutions |
offspring_fitness |
Fitness of offspring |
population |
Current population |
population_fitness |
Current population fitness |
crowding_factor |
Number of candidates to compare for replacement |
Updated population after crowding replacement
Computes the D-optimality criterion, which minimizes the determinant of the inverse covariance matrix. Lower values indicate better designs.
d_optimality(train, test = NULL, P, lambda = 1e-06, C = NULL)d_optimality(train, test = NULL, P, lambda = 1e-06, C = NULL)
train |
Vector of training set individual names |
test |
Vector of test set individual names (if NULL, uses training set) |
P |
Prediction matrix with individuals as rows and variables as columns |
lambda |
Ridge regularization parameter for numerical stability (default: 1e-6) |
C |
Contrast matrix for specific linear combinations (default: NULL) |
D-optimality minimizes the generalized variance of parameter estimates. It is equivalent to maximizing the determinant of X'X where X is the design matrix.
Mathematical formula: D = log(det(X'X))
D-optimality value (negative log determinant)
Kiefer, J. (1959). Optimum experimental designs. Journal of the Royal Statistical Society B.
# Load wheat genomic data data(WheatData) # Create a subset for demonstration set.seed(456) subset_indices <- sample(1:nrow(Wheat.M), 60) M_subset <- Wheat.M[subset_indices, 1:15] # Extract principal components for D-optimality pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:4] rownames(PC_subset) <- rownames(M_subset) # Define sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 10) candidates <- setdiff(all_individuals, test_set) train_set <- sample(candidates, 15) # Compute D-optimality (minimizes generalized variance) dopt_value <- d_optimality(train_set, test_set, PC_subset) print(paste("D-optimality value:", round(dopt_value, 4))) # Compare with different training sets train_set2 <- sample(candidates, 15) dopt_value2 <- d_optimality(train_set2, test_set, PC_subset) print(paste("Training set 1 D-opt:", round(dopt_value, 4))) print(paste("Training set 2 D-opt:", round(dopt_value2, 4))) if (dopt_value < dopt_value2) { print("Training set 1 is better (lower D-optimality)") } else { print("Training set 2 is better (lower D-optimality)") }# Load wheat genomic data data(WheatData) # Create a subset for demonstration set.seed(456) subset_indices <- sample(1:nrow(Wheat.M), 60) M_subset <- Wheat.M[subset_indices, 1:15] # Extract principal components for D-optimality pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:4] rownames(PC_subset) <- rownames(M_subset) # Define sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 10) candidates <- setdiff(all_individuals, test_set) train_set <- sample(candidates, 15) # Compute D-optimality (minimizes generalized variance) dopt_value <- d_optimality(train_set, test_set, PC_subset) print(paste("D-optimality value:", round(dopt_value, 4))) # Compare with different training sets train_set2 <- sample(candidates, 15) dopt_value2 <- d_optimality(train_set2, test_set, PC_subset) print(paste("Training set 1 D-opt:", round(dopt_value, 4))) print(paste("Training set 2 D-opt:", round(dopt_value2, 4))) if (dopt_value < dopt_value2) { print("Training set 1 is better (lower D-optimality)") } else { print("Training set 2 is better (lower D-optimality)") }
Determine optimal batch size for evaluation
determine_optimal_batch_size(n_solutions, mc.cores)determine_optimal_batch_size(n_solutions, mc.cores)
n_solutions |
Number of solutions |
mc.cores |
Number of cores |
Optimal batch size
Determine primary matrix for optimization
determine_primary_matrix(Pcs, Dist, Kernel)determine_primary_matrix(Pcs, Dist, Kernel)
Pcs |
Principal components matrix |
Dist |
Distance matrix |
Kernel |
Kernel matrix |
Primary matrix to use for optimization
Unified distance criterion function
distance_criterion( train, test, distance_matrix, criterion = c("max_to_test", "mean_to_test", "neg_min_internal", "neg_mean_internal"), lambda = NULL, C = NULL )distance_criterion( train, test, distance_matrix, criterion = c("max_to_test", "mean_to_test", "neg_min_internal", "neg_mean_internal"), lambda = NULL, C = NULL )
train |
Training set individual names |
test |
Test set individual names |
distance_matrix |
Distance matrix |
criterion |
Type of distance criterion |
lambda |
Ridge parameter (for compatibility) |
C |
Contrast matrix (for compatibility) |
Distance criterion value
Negative mean distance within training set
distance_internal_mean( train, test = NULL, distance_matrix, lambda = NULL, C = NULL )distance_internal_mean( train, test = NULL, distance_matrix, lambda = NULL, C = NULL )
train |
Training set individual names |
test |
Test set individual names (unused, kept for compatibility) |
distance_matrix |
Distance matrix |
lambda |
Ridge parameter (unused, kept for compatibility) |
C |
Contrast matrix (unused, kept for compatibility) |
Negative mean internal distance
Negative minimum distance within training set
distance_internal_min( train, test = NULL, distance_matrix, lambda = NULL, C = NULL )distance_internal_min( train, test = NULL, distance_matrix, lambda = NULL, C = NULL )
train |
Training set individual names |
test |
Test set individual names (unused, kept for compatibility) |
distance_matrix |
Distance matrix |
lambda |
Ridge parameter (unused, kept for compatibility) |
C |
Contrast matrix (unused, kept for compatibility) |
Negative minimum internal distance
Distance-based criteria and utilities for subset selection
distance_to_ideal(X, method = "euclidean", handle_zeros = "warning")distance_to_ideal(X, method = "euclidean", handle_zeros = "warning")
X |
Matrix of solutions (rows) by objectives (columns) |
method |
Distance method: "euclidean", "manhattan", "chebyshev" (default: "euclidean") |
handle_zeros |
How to handle zero ranges: "warning", "error", "ignore" (default: "warning") |
Vector of distances to ideal point (0,0,...,0)
Deniz Akdemir Distance to ideal point calculation
Maximum distance from training to test set
distance_train_to_test_max( train, test, distance_matrix, lambda = NULL, C = NULL )distance_train_to_test_max( train, test, distance_matrix, lambda = NULL, C = NULL )
train |
Training set individual names |
test |
Test set individual names |
distance_matrix |
Distance matrix |
lambda |
Ridge parameter (unused, kept for compatibility) |
C |
Contrast matrix (unused, kept for compatibility) |
Maximum distance value
Mean distance from training to test set
distance_train_to_test_mean( train, test, distance_matrix, lambda = NULL, C = NULL )distance_train_to_test_mean( train, test, distance_matrix, lambda = NULL, C = NULL )
train |
Training set individual names |
test |
Test set individual names |
distance_matrix |
Distance matrix |
lambda |
Ridge parameter (unused, kept for compatibility) |
C |
Contrast matrix (unused, kept for compatibility) |
Mean distance value
Diversity-preserving selection for environmental selection
diversity_preserving_selection( population, fitness, target_size, criteria_types )diversity_preserving_selection( population, fitness, target_size, criteria_types )
population |
Combined population |
fitness |
Combined fitness matrix |
target_size |
Target population size |
criteria_types |
Optimization types |
Indices of selected solutions
Population diversity summary
diversity_summary(solutions, method = "jaccard")diversity_summary(solutions, method = "jaccard")
solutions |
List of solutions |
method |
Distance method for calculation |
List with diversity statistics
Generate consistent parameter documentation
document_parameter(param_name, param_type, default_value = NULL, description)document_parameter(param_name, param_type, default_value = NULL, description)
param_name |
Parameter name |
param_type |
Parameter type |
default_value |
Default value |
description |
Parameter description |
Formatted roxygen2 parameter documentation
Computes the E-optimality criterion, which minimizes the maximum eigenvalue of the inverse covariance matrix. Lower values indicate better designs.
e_optimality(train, test = NULL, P, lambda = 1e-06, C = NULL)e_optimality(train, test = NULL, P, lambda = 1e-06, C = NULL)
train |
Vector of training set individual names |
test |
Vector of test set individual names (if NULL, uses training set) |
P |
Prediction matrix with individuals as rows and variables as columns |
lambda |
Ridge regularization parameter for numerical stability (default: 1e-6) |
C |
Contrast matrix for specific linear combinations (default: NULL) |
E-optimality minimizes the maximum variance of any linear combination of parameters. It is equivalent to maximizing the minimum eigenvalue of X'X where X is the design matrix.
Mathematical formula: E = log(λ_max((X'X)^(-1)))
E-optimality value (negative log of minimum eigenvalue)
Atkinson, A.C., Donev, A.N., Tobias, R.D. (2007). Optimum Experimental Designs. Oxford.
# Load wheat genomic data data(WheatData) # Create a subset for demonstration set.seed(321) subset_indices <- sample(1:nrow(Wheat.M), 80) M_subset <- Wheat.M[subset_indices, 1:20] # Extract principal components pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 15) candidates <- setdiff(all_individuals, test_set) train_set <- sample(candidates, 20) # Compute E-optimality (minimizes maximum variance) eopt_value <- e_optimality(train_set, test_set, PC_subset) print(paste("E-optimality value:", round(eopt_value, 6))) # Compare with A and D optimality for same training set aopt_value <- a_optimality(train_set, test_set, PC_subset) dopt_value <- d_optimality(train_set, test_set, PC_subset) print(paste("A-optimality:", round(aopt_value, 6))) print(paste("D-optimality:", round(dopt_value, 6))) print(paste("E-optimality:", round(eopt_value, 6)))# Load wheat genomic data data(WheatData) # Create a subset for demonstration set.seed(321) subset_indices <- sample(1:nrow(Wheat.M), 80) M_subset <- Wheat.M[subset_indices, 1:20] # Extract principal components pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 15) candidates <- setdiff(all_individuals, test_set) train_set <- sample(candidates, 20) # Compute E-optimality (minimizes maximum variance) eopt_value <- e_optimality(train_set, test_set, PC_subset) print(paste("E-optimality value:", round(eopt_value, 6))) # Compare with A and D optimality for same training set aopt_value <- a_optimality(train_set, test_set, PC_subset) dopt_value <- d_optimality(train_set, test_set, PC_subset) print(paste("A-optimality:", round(aopt_value, 6))) print(paste("D-optimality:", round(dopt_value, 6))) print(paste("E-optimality:", round(eopt_value, 6)))
Elite selection
elite_selection(population, fitness, n_select)elite_selection(population, fitness, n_select)
population |
List of solutions |
fitness |
Vector of fitness values |
n_select |
Number of elites to select |
Indices of selected elite individuals
Environmental selection for NSGA-II
environmental_selection(fitness, ranks, target_size)environmental_selection(fitness, ranks, target_size)
fitness |
Combined fitness matrix |
ranks |
Dominance ranks |
target_size |
Target population size |
Indices of selected solutions
Evaluate population for multiple objectives
evaluate_multiobjective_population( population, P, test, criteria, lambda, C, mc.cores )evaluate_multiobjective_population( population, P, test, criteria, lambda, C, mc.cores )
population |
List of solutions |
P |
Primary matrix |
test |
Test set |
criteria |
Vector of criteria names |
lambda |
Ridge parameter |
C |
Contrast matrix |
mc.cores |
Number of cores |
Matrix of fitness values (solutions x criteria)
Efficient population evaluation and fitness computation
evaluate_population( population, P, test = NULL, criterion = "PEVMEAN2", lambda = 1e-06, C = NULL, K = NULL, Vg = NULL, Ve = NULL, mc.cores = 1, use_cache = TRUE )evaluate_population( population, P, test = NULL, criterion = "PEVMEAN2", lambda = 1e-06, C = NULL, K = NULL, Vg = NULL, Ve = NULL, mc.cores = 1, use_cache = TRUE )
population |
List of candidate solutions (training sets) |
P |
Prediction matrix |
test |
Test set individual names (can be NULL) |
criterion |
Optimization criterion name (default: "PEVMEAN2") |
lambda |
Ridge regression parameter (default: 1e-6) |
C |
Contrast matrix (default: NULL) |
K |
Kinship matrix for mixed models (default: NULL) |
Vg |
Genetic variance matrix (default: NULL) |
Ve |
Error variance matrix (default: NULL) |
mc.cores |
Number of cores for parallel processing (default: 1) |
use_cache |
Whether to use computation cache (default: TRUE) |
Vector of fitness values for population
Deniz Akdemir Optimized population evaluation with smart caching
Smart population evaluation with adaptive batch processing
evaluate_population_smart( population, P, test = NULL, criterion = "PEVMEAN2", lambda = 1e-06, C = NULL, K = NULL, Vg = NULL, Ve = NULL, mc.cores = 1, batch_size = NULL )evaluate_population_smart( population, P, test = NULL, criterion = "PEVMEAN2", lambda = 1e-06, C = NULL, K = NULL, Vg = NULL, Ve = NULL, mc.cores = 1, batch_size = NULL )
population |
List of solutions |
P |
Prediction matrix |
test |
Test set names |
criterion |
Criterion name |
lambda |
Ridge parameter |
C |
Contrast matrix |
K |
Kinship matrix |
Vg |
Genetic variance |
Ve |
Error variance |
mc.cores |
Number of cores |
batch_size |
Batch size for processing (default: NULL for automatic) |
Vector of fitness values
Evaluate a batch of solutions efficiently
evaluate_solution_batch( solutions, P, test, criterion, lambda, C, K, Vg, Ve, mc.cores )evaluate_solution_batch( solutions, P, test, criterion, lambda, C, K, Vg, Ve, mc.cores )
solutions |
List of solutions in the batch |
P |
Prediction matrix |
test |
Test set |
criterion |
Criterion name |
lambda |
Ridge parameter |
C |
Contrast matrix |
K |
Kinship matrix |
Vg |
Genetic variance |
Ve |
Error variance |
mc.cores |
Number of cores |
Vector of fitness values
Extract final Pareto front
extract_pareto_front(archive, fitness, criteria_types)extract_pareto_front(archive, fitness, criteria_types)
archive |
Archive of solutions |
fitness |
Archive fitness matrix |
criteria_types |
Optimization types |
List with Pareto solutions and fitness
Find non-dominated solutions
find_non_dominated(fitness, criteria_types)find_non_dominated(fitness, criteria_types)
fitness |
Fitness matrix |
criteria_types |
Optimization types |
Indices of non-dominated solutions
Finish progress bar
finish_progress(pb)finish_progress(pb)
pb |
Progress bar object |
Fitness sharing for diversity preservation
fitness_sharing(fitness, population, sharing_radius = 0.1, alpha = 1)fitness_sharing(fitness, population, sharing_radius = 0.1, alpha = 1)
fitness |
Vector of fitness values |
population |
List of solutions |
sharing_radius |
Radius for fitness sharing (default: 0.1) |
alpha |
Sharing function exponent (default: 1) |
Adjusted fitness values
G-optimality criterion (minimizes maximum prediction variance)
g_optimality(train, test = NULL, P, lambda = 1e-06)g_optimality(train, test = NULL, P, lambda = 1e-06)
train |
Vector of training set individual names |
test |
Vector of test set individual names (if NULL, uses training set) |
P |
Prediction matrix with individuals as rows and variables as columns |
lambda |
Ridge regularization parameter (default: 1e-6) |
G-optimality minimizes the maximum entry in the diagonal of the hat matrix, providing the best worst-case prediction precision.
G-optimality value
Kiefer, J. (1975). Construction and optimality of generalized Youden designs.
# Load wheat genomic data data(WheatData) # Create a subset for demonstration set.seed(753) subset_indices <- sample(1:nrow(Wheat.M), 80) M_subset <- Wheat.M[subset_indices, 1:20] # Extract principal components pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 15) candidates <- setdiff(all_individuals, test_set) train_set <- sample(candidates, 20) # Compute G-optimality (minimizes maximum prediction variance) gopt_value <- g_optimality(train_set, test_set, PC_subset) print(paste("G-optimality value:", round(gopt_value, 6))) # Compare with other optimality criteria aopt <- a_optimality(train_set, test_set, PC_subset) dopt <- d_optimality(train_set, test_set, PC_subset) eopt <- e_optimality(train_set, test_set, PC_subset) print("Optimality Comparison:") print(paste("A-optimality (average variance):", round(aopt, 6))) print(paste("D-optimality (volume):", round(dopt, 6))) print(paste("E-optimality (max eigenvalue):", round(eopt, 6))) print(paste("G-optimality (max hat diagonal):", round(gopt_value, 6)))# Load wheat genomic data data(WheatData) # Create a subset for demonstration set.seed(753) subset_indices <- sample(1:nrow(Wheat.M), 80) M_subset <- Wheat.M[subset_indices, 1:20] # Extract principal components pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 15) candidates <- setdiff(all_individuals, test_set) train_set <- sample(candidates, 20) # Compute G-optimality (minimizes maximum prediction variance) gopt_value <- g_optimality(train_set, test_set, PC_subset) print(paste("G-optimality value:", round(gopt_value, 6))) # Compare with other optimality criteria aopt <- a_optimality(train_set, test_set, PC_subset) dopt <- d_optimality(train_set, test_set, PC_subset) eopt <- e_optimality(train_set, test_set, PC_subset) print("Optimality Comparison:") print(paste("A-optimality (average variance):", round(aopt, 6))) print(paste("D-optimality (volume):", round(dopt, 6))) print(paste("E-optimality (max eigenvalue):", round(eopt, 6))) print(paste("G-optimality (max hat diagonal):", round(gopt_value, 6)))
Generate standard function documentation template
generate_doc_template(func_name, purpose, category = "General")generate_doc_template(func_name, purpose, category = "General")
func_name |
Function name |
purpose |
Function purpose |
category |
Function category |
Roxygen2 documentation template
Generate offspring for multi-objective optimization
generate_multiobjective_offspring( parents, candidates, ntoselect, n_offspring, mutprob, mutintensity, mc.cores )generate_multiobjective_offspring( parents, candidates, ntoselect, n_offspring, mutprob, mutintensity, mc.cores )
parents |
Parent solutions |
candidates |
Candidate pool |
ntoselect |
Solution size |
n_offspring |
Number of offspring to generate |
mutprob |
Mutation probability |
mutintensity |
Mutation intensity |
mc.cores |
Number of cores |
List of offspring solutions
Generate offspring from elite solutions
generate_offspring( elites, candidates, npop, mutprob, mc.cores = 1, mutintensity = 1, tabu_memory = NULL, ntoselect = NULL, crossover_method = "uniform", parallel_threshold = 50 )generate_offspring( elites, candidates, npop, mutprob, mc.cores = 1, mutintensity = 1, tabu_memory = NULL, ntoselect = NULL, crossover_method = "uniform", parallel_threshold = 50 )
elites |
List of elite solutions |
candidates |
Pool of candidate individuals |
npop |
Population size to generate |
mutprob |
Mutation probability |
mc.cores |
Number of cores for parallel processing (default: 1) |
mutintensity |
Mutation intensity (default: 1) |
tabu_memory |
Tabu memory for avoiding revisits (default: NULL) |
ntoselect |
Target size of solutions (default: NULL, inferred from elites) |
crossover_method |
Method for crossover operations (default: "uniform") |
parallel_threshold |
Minimum population size for parallel processing (default: 50) |
List of offspring solutions
Compute genomic relationship matrix using different methods
genomic_relationship_matrix(M, method = "vanraden", pieces = 10, mc.cores = 1)genomic_relationship_matrix(M, method = "vanraden", pieces = 10, mc.cores = 1)
M |
Marker matrix coded as -1,0,1 |
method |
Method for computation: "vanraden", "pieces", "standard" (default: "vanraden") |
pieces |
Number of pieces for piece-wise computation (default: 10) |
mc.cores |
Number of cores for parallel processing (default: 1) |
Genomic relationship matrix
Get adaptive ridge parameter based on condition number
get_adaptive_ridge( X, target_condition = 1e+08, min_lambda = 1e-10, max_lambda = 0.01 )get_adaptive_ridge( X, target_condition = 1e+08, min_lambda = 1e-10, max_lambda = 0.01 )
X |
Matrix to analyze |
target_condition |
Target condition number (default: 1e8) |
min_lambda |
Minimum lambda value (default: 1e-10) |
max_lambda |
Maximum lambda value (default: 1e-2) |
Adaptive ridge parameter
Get consistent function name variants
get_function_variants(base_name)get_function_variants(base_name)
base_name |
Base function name |
List with name variants for different contexts
Get optimal number of cores for computation
get_optimal_cores(data_size, operation_type = "mixed")get_optimal_cores(data_size, operation_type = "mixed")
data_size |
Approximate size of data |
operation_type |
Type of operation ("cpu", "memory", "mixed") |
Recommended number of cores
Group similar solutions for batch processing
group_similar_solutions(population, target_batch_size)group_similar_solutions(population, target_batch_size)
population |
List of solutions |
target_batch_size |
Target size for each batch |
List of solution groups
Literature-corrected optimization criteria for subset selection in experimental design
h2_to_variances(h2, total_var = 1)h2_to_variances(h2, total_var = 1)
h2 |
Heritability (proportion of variance due to genetics) |
total_var |
Total variance (default: 1) |
List with Vg (genetic variance) and Ve (error variance)
Deniz Akdemir
Fedorov, V.V. (1972). Theory of Optimal Experiments. Academic Press. Henderson, C.R. (1984). Applications of Linear Models in Animal Breeding. Kiefer, J. (1959). Optimum experimental designs. Journal of the Royal Statistical Society B. Searle, S.R., Casella, G., McCulloch, C.E. (1992). Variance Components. Wiley. Atkinson, A.C., Donev, A.N., Tobias, R.D. (2007). Optimum Experimental Designs. Oxford. Convert between variance components and heritability
# Convert heritability to variance components h2_values <- c(0.1, 0.3, 0.5, 0.7, 0.9) for (h2 in h2_values) { var_comps <- h2_to_variances(h2) print(paste("h² =", h2, "-> Vg =", var_comps$Vg, ", Ve =", var_comps$Ve)) } # Example with custom total variance var_comps_scaled <- h2_to_variances(0.6, total_var = 2.5) print(paste("h² = 0.6, total_var = 2.5 -> Vg =", var_comps_scaled$Vg, ", Ve =", var_comps_scaled$Ve))# Convert heritability to variance components h2_values <- c(0.1, 0.3, 0.5, 0.7, 0.9) for (h2 in h2_values) { var_comps <- h2_to_variances(h2) print(paste("h² =", h2, "-> Vg =", var_comps$Vg, ", Ve =", var_comps$Ve)) } # Example with custom total variance var_comps_scaled <- h2_to_variances(0.6, total_var = 2.5) print(paste("h² = 0.6, total_var = 2.5 -> Vg =", var_comps_scaled$Vg, ", Ve =", var_comps_scaled$Ve))
I-optimality criterion (minimizes average prediction variance)
i_optimality(train, test = NULL, P, lambda = 1e-06)i_optimality(train, test = NULL, P, lambda = 1e-06)
train |
Vector of training set individual names |
test |
Vector of test set individual names (if NULL, uses training set) |
P |
Prediction matrix with individuals as rows and variables as columns |
lambda |
Ridge regularization parameter (default: 1e-6) |
I-optimality minimizes the average prediction variance over the design space.
I-optimality value
Fedorov, V.V. (1972). Theory of Optimal Experiments.
# Load wheat genomic data data(WheatData) # Create a subset for demonstration set.seed(951) subset_indices <- sample(1:nrow(Wheat.M), 80) M_subset <- Wheat.M[subset_indices, 1:20] # Extract principal components pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 15) candidates <- setdiff(all_individuals, test_set) train_set <- sample(candidates, 20) # Compute I-optimality (average prediction variance over design space) iopt_value <- i_optimality(train_set, test_set, PC_subset) print(paste("I-optimality value:", round(iopt_value, 6))) # Compare I-optimality with A-optimality (both are average-based) aopt_value <- a_optimality(train_set, test_set, PC_subset) print(paste("A-optimality (parameter space):", round(aopt_value, 6))) print(paste("I-optimality (design space):", round(iopt_value, 6))) # Test different training set sizes small_train <- sample(candidates, 12) large_train <- sample(candidates, 28) iopt_small <- i_optimality(small_train, test_set, PC_subset) iopt_large <- i_optimality(large_train, test_set, PC_subset) print(paste("Small training (n=12):", round(iopt_small, 6))) print(paste("Large training (n=28):", round(iopt_large, 6))) print(paste("Improvement factor:", round(iopt_small / iopt_large, 2)))# Load wheat genomic data data(WheatData) # Create a subset for demonstration set.seed(951) subset_indices <- sample(1:nrow(Wheat.M), 80) M_subset <- Wheat.M[subset_indices, 1:20] # Extract principal components pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 15) candidates <- setdiff(all_individuals, test_set) train_set <- sample(candidates, 20) # Compute I-optimality (average prediction variance over design space) iopt_value <- i_optimality(train_set, test_set, PC_subset) print(paste("I-optimality value:", round(iopt_value, 6))) # Compare I-optimality with A-optimality (both are average-based) aopt_value <- a_optimality(train_set, test_set, PC_subset) print(paste("A-optimality (parameter space):", round(aopt_value, 6))) print(paste("I-optimality (design space):", round(iopt_value, 6))) # Test different training set sizes small_train <- sample(candidates, 12) large_train <- sample(candidates, 28) iopt_small <- i_optimality(small_train, test_set, PC_subset) iopt_large <- i_optimality(large_train, test_set, PC_subset) print(paste("Small training (n=12):", round(iopt_small, 6))) print(paste("Large training (n=28):", round(iopt_large, 6))) print(paste("Improvement factor:", round(iopt_small / iopt_large, 2)))
Computes the leverage-based influence measure that was incorrectly labeled as "coefficient of determination" in the original implementation.
influence_measure_legacy( train, test = NULL, P, lambda = 1e-06, C = NULL, normalized = FALSE )influence_measure_legacy( train, test = NULL, P, lambda = 1e-06, C = NULL, normalized = FALSE )
train |
Vector of training set individual names |
test |
Vector of test set individual names (if NULL, uses training set) |
P |
Prediction matrix with individuals as rows and variables as columns |
lambda |
Ridge regularization parameter (default: 1e-6) |
C |
Contrast matrix (default: NULL) |
normalized |
Whether to normalize (default: FALSE) |
This function computes what the original cd_mean actually calculated: Influence = (leverage_i / (1 - leverage_i)²) × p
This is related to Cook's distance without the residual component. For proper R², use cd_mean() which computes average leverage.
Mean influence measure value
Initialize random population
initialize_population(candidates, ntoselect, npop)initialize_population(candidates, ntoselect, npop)
candidates |
Candidate individuals |
ntoselect |
Number to select per solution |
npop |
Population size |
List of initial solutions
Deprecated or compatibility wrappers retained for older STPGA code.
New code should use the corresponding modern functions documented elsewhere,
including compute_amatrix(), population_distances(),
distance_train_to_test_max(), distance_train_to_test_mean(),
distance_internal_min(), distance_internal_mean(),
distance_criterion(), evaluate_population(),
subset_ga_multiobjective(), and
subset_ga_multiobjective_single().
Amat.pieces(M, pieces = 10, mc.cores = 1) calculate_population_distances(solutions, method = c("hamming", "jaccard", "sorensen")) dist_to_test(train, test, Dst, lambda = NULL, C = NULL) dist_to_test2(train, test, Dst, lambda = NULL, C = NULL) disttoideal(X, method = "euclidean", handle_zeros = "warning") evaluate_population_legacy( population, errorstat, Test, P, K = NULL, lambda = 1e-06, C = NULL, mc.cores = 1, Vg = NULL, Ve = NULL ) evaluate_population_optimized( population, P, Test, errorstat = "PEVMEAN2", lambda = 1e-06, C = NULL, mc.cores = 1, K = NULL, Vg = NULL, Ve = NULL ) GenAlgForSubsetSelectionMO(...) GenAlgForSubsetSelectionMONoTest(...) neg_dist_in_train(train, test = NULL, Dst, lambda = NULL, C = NULL) neg_dist_in_train2(train, test = NULL, Dst, lambda = NULL, C = NULL) unified_distance_criterion( train, test, Dst, criterion = c("max_to_test", "mean_to_test", "neg_min_internal", "neg_mean_internal"), lambda = NULL, C = NULL )Amat.pieces(M, pieces = 10, mc.cores = 1) calculate_population_distances(solutions, method = c("hamming", "jaccard", "sorensen")) dist_to_test(train, test, Dst, lambda = NULL, C = NULL) dist_to_test2(train, test, Dst, lambda = NULL, C = NULL) disttoideal(X, method = "euclidean", handle_zeros = "warning") evaluate_population_legacy( population, errorstat, Test, P, K = NULL, lambda = 1e-06, C = NULL, mc.cores = 1, Vg = NULL, Ve = NULL ) evaluate_population_optimized( population, P, Test, errorstat = "PEVMEAN2", lambda = 1e-06, C = NULL, mc.cores = 1, K = NULL, Vg = NULL, Ve = NULL ) GenAlgForSubsetSelectionMO(...) GenAlgForSubsetSelectionMONoTest(...) neg_dist_in_train(train, test = NULL, Dst, lambda = NULL, C = NULL) neg_dist_in_train2(train, test = NULL, Dst, lambda = NULL, C = NULL) unified_distance_criterion( train, test, Dst, criterion = c("max_to_test", "mean_to_test", "neg_min_internal", "neg_mean_internal"), lambda = NULL, C = NULL )
M, P, X
|
Input matrix. |
pieces |
Number of marker chunks for relationship matrix computation. |
mc.cores |
Number of parallel worker cores. |
solutions, population
|
Candidate solutions to evaluate. |
method |
Distance method. |
train, test, Test
|
Training and test set identifiers. |
Dst |
Distance matrix. |
lambda |
Ridge parameter. |
C |
Optional contrast matrix. |
handle_zeros |
How zero ranges are handled. |
errorstat |
Legacy criterion name. |
K |
Optional kinship matrix. |
Vg, Ve
|
Genetic and residual variance components. |
criterion |
Distance criterion name. |
... |
Arguments passed to the corresponding modern implementation. |
The value returned by the corresponding modern implementation.
Compute condition number and numerical stability metrics
matrix_stability_check(X)matrix_stability_check(X)
X |
Matrix to analyze |
List with stability metrics
Multi-objective selection based on dominance ranks
multiobjective_selection(fitness, ranks, n_select)multiobjective_selection(fitness, ranks, n_select)
fitness |
Fitness matrix |
ranks |
Dominance ranks |
n_select |
Number of solutions to select |
Indices of selected solutions
Mutate a solution
mutate_solution(solution, candidates, intensity, ntoselect)mutate_solution(solution, candidates, intensity, ntoselect)
solution |
Current solution to mutate |
candidates |
Pool of candidates |
intensity |
Mutation intensity |
ntoselect |
Target solution size |
Mutated solution
Non-dominated sorting for multi-objective optimization
non_dominated_sorting(fitness, criteria_types)non_dominated_sorting(fitness, criteria_types)
fitness |
Fitness matrix |
criteria_types |
Vector indicating minimize (TRUE) or maximize (FALSE) |
Vector of dominance ranks
Parallel matrix operations with proper error handling
parallel_apply(X, FUN, mc.cores = parallel::detectCores() - 1, ...)parallel_apply(X, FUN, mc.cores = parallel::detectCores() - 1, ...)
X |
List of matrices or data to process |
FUN |
Function to apply |
mc.cores |
Number of cores to use |
... |
Additional arguments to FUN |
List of results
Perform genetic algorithm restart with population diversification
perform_restart( current_population, best_solution, candidates, ntoselect, npop, generation )perform_restart( current_population, best_solution, candidates, ntoselect, npop, generation )
current_population |
Current population |
best_solution |
Current best solution |
candidates |
Candidate pool |
ntoselect |
Number to select per solution |
npop |
Population size |
generation |
Current generation (for restart strategies) |
List with new diversified population
Maximum Prediction Error Variance (PEV)
pev_max(train, test = NULL, P, lambda = 1e-06, C = NULL, normalized = FALSE)pev_max(train, test = NULL, P, lambda = 1e-06, C = NULL, normalized = FALSE)
train |
Vector of training set individual names |
test |
Vector of test set individual names (if NULL, uses training set) |
P |
Prediction matrix with individuals as rows and variables as columns |
lambda |
Ridge regularization parameter for numerical stability (default: 1e-6) |
C |
Contrast matrix for specific linear combinations (default: NULL) |
normalized |
Whether to normalize by trace for scale-invariance (default: FALSE) |
Maximum PEV value (lower is better)
# Load wheat genomic data data(WheatData) # Create a subset for demonstration set.seed(654) subset_indices <- sample(1:nrow(Wheat.M), 80) M_subset <- Wheat.M[subset_indices, 1:25] # Extract principal components pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 15) candidates <- setdiff(all_individuals, test_set) train_set <- sample(candidates, 20) # Compute maximum PEV (worst-case prediction uncertainty) pev_max_value <- pev_max(train_set, test_set, PC_subset) pev_mean_value <- pev_mean(train_set, test_set, PC_subset) print(paste("Maximum PEV:", round(pev_max_value, 6))) print(paste("Mean PEV:", round(pev_mean_value, 6))) print(paste("Max/Mean ratio:", round(pev_max_value / pev_mean_value, 2))) # Compare normalized versions pev_max_norm <- pev_max(train_set, test_set, PC_subset, normalized = TRUE) pev_mean_norm <- pev_mean(train_set, test_set, PC_subset, normalized = TRUE) print(paste("Normalized Max PEV:", round(pev_max_norm, 6))) print(paste("Normalized Mean PEV:", round(pev_mean_norm, 6)))# Load wheat genomic data data(WheatData) # Create a subset for demonstration set.seed(654) subset_indices <- sample(1:nrow(Wheat.M), 80) M_subset <- Wheat.M[subset_indices, 1:25] # Extract principal components pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 15) candidates <- setdiff(all_individuals, test_set) train_set <- sample(candidates, 20) # Compute maximum PEV (worst-case prediction uncertainty) pev_max_value <- pev_max(train_set, test_set, PC_subset) pev_mean_value <- pev_mean(train_set, test_set, PC_subset) print(paste("Maximum PEV:", round(pev_max_value, 6))) print(paste("Mean PEV:", round(pev_mean_value, 6))) print(paste("Max/Mean ratio:", round(pev_max_value / pev_mean_value, 2))) # Compare normalized versions pev_max_norm <- pev_max(train_set, test_set, PC_subset, normalized = TRUE) pev_mean_norm <- pev_mean(train_set, test_set, PC_subset, normalized = TRUE) print(paste("Normalized Max PEV:", round(pev_max_norm, 6))) print(paste("Normalized Mean PEV:", round(pev_mean_norm, 6)))
Computes the mean prediction error variance for the test set given a training set. This measures the average uncertainty in predictions.
pev_mean(train, test = NULL, P, lambda = 1e-06, C = NULL, normalized = FALSE)pev_mean(train, test = NULL, P, lambda = 1e-06, C = NULL, normalized = FALSE)
train |
Vector of training set individual names |
test |
Vector of test set individual names (if NULL, uses training set) |
P |
Prediction matrix with individuals as rows and variables as columns |
lambda |
Ridge regularization parameter for numerical stability (default: 1e-6) |
C |
Contrast matrix for specific linear combinations (default: NULL) |
normalized |
Whether to normalize by trace for scale-invariance (default: FALSE) |
Corrected prediction error variance formula: PEV = Var(y - ŷ) = σ²[I + X_test(X_train'X_train)^(-1)X_test']
The identity matrix accounts for the inherent variability of new observations. This is the correct formula according to experimental design literature.
Mean PEV value (lower is better)
# Load wheat genomic data data(WheatData) # Create a subset for demonstration set.seed(789) subset_indices <- sample(1:nrow(Wheat.M), 80) M_subset <- Wheat.M[subset_indices, 1:25] # Extract principal components pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 15) candidates <- setdiff(all_individuals, test_set) train_set <- sample(candidates, 20) # Compute mean PEV (prediction accuracy measure) pev_value <- pev_mean(train_set, test_set, PC_subset) print(paste("Mean PEV:", round(pev_value, 6))) # Compare normalized vs non-normalized pev_norm <- pev_mean(train_set, test_set, PC_subset, normalized = TRUE) print(paste("Normalized PEV:", round(pev_norm, 6))) # Effect of training set size on PEV small_train <- sample(candidates, 10) large_train <- sample(candidates, 30) pev_small <- pev_mean(small_train, test_set, PC_subset) pev_large <- pev_mean(large_train, test_set, PC_subset) print(paste("Small training PEV:", round(pev_small, 6))) print(paste("Large training PEV:", round(pev_large, 6))) print(paste("PEV reduction factor:", round(pev_small / pev_large, 2)))# Load wheat genomic data data(WheatData) # Create a subset for demonstration set.seed(789) subset_indices <- sample(1:nrow(Wheat.M), 80) M_subset <- Wheat.M[subset_indices, 1:25] # Extract principal components pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 15) candidates <- setdiff(all_individuals, test_set) train_set <- sample(candidates, 20) # Compute mean PEV (prediction accuracy measure) pev_value <- pev_mean(train_set, test_set, PC_subset) print(paste("Mean PEV:", round(pev_value, 6))) # Compare normalized vs non-normalized pev_norm <- pev_mean(train_set, test_set, PC_subset, normalized = TRUE) print(paste("Normalized PEV:", round(pev_norm, 6))) # Effect of training set size on PEV small_train <- sample(candidates, 10) large_train <- sample(candidates, 30) pev_small <- pev_mean(small_train, test_set, PC_subset) pev_large <- pev_mean(large_train, test_set, PC_subset) print(paste("Small training PEV:", round(pev_small, 6))) print(paste("Large training PEV:", round(pev_large, 6))) print(paste("PEV reduction factor:", round(pev_small / pev_large, 2)))
Computes the mean prediction error variance for mixed models accounting for genetic relationships through a kinship matrix. Uses correct BLUP theory.
pev_mean_mm(train, test = NULL, P, K, lambda = 1e-06, C = NULL, Vg = 1, Ve = 1)pev_mean_mm(train, test = NULL, P, K, lambda = 1e-06, C = NULL, Vg = 1, Ve = 1)
train |
Vector of training set individual names |
test |
Vector of test set individual names (if NULL, uses training set) |
P |
Prediction matrix with individuals as rows and markers as columns |
K |
Kinship/relationship matrix (symmetric, positive definite) |
lambda |
Ridge regularization parameter for numerical stability (default: 1e-6) |
C |
Contrast matrix for specific linear combinations (default: NULL) |
Vg |
Genetic variance component (default: 1) |
Ve |
Error variance component (default: 1) |
Implements the correct BLUP prediction error variance formula from Henderson (1984): PEV = sigma_u^2[G_22 - G_21 V_11^(-1) G_12 - G_21 V_11^(-1) X_1 (X_1' V_11^(-1) X_1)^(-1) X_1' V_11^(-1) G_12]
where V_11 = sigma_u^2 G_11 + sigma_e^2 I and both correction terms are SUBTRACTED.
Mean PEV value for mixed models (lower is better)
Henderson, C.R. (1984). Applications of Linear Models in Animal Breeding. Searle, S.R., Casella, G., McCulloch, C.E. (1992). Variance Components.
# Load wheat genomic data data(WheatData) # Create a subset for mixed model demonstration set.seed(987) subset_indices <- sample(1:nrow(Wheat.M), 60) M_subset <- Wheat.M[subset_indices, 1:50] K_subset <- Wheat.K[subset_indices, subset_indices] # Define train and test sets all_individuals <- rownames(M_subset) test_set <- sample(all_individuals, 12) train_set <- sample(setdiff(all_individuals, test_set), 18) # Mixed model PEV with different heritability scenarios h2_values <- c(0.2, 0.5, 0.8) print("PEV Mean MM - Different Heritabilities:") for (h2 in h2_values) { var_comps <- h2_to_variances(h2) pev_mm <- pev_mean_mm(train_set, test_set, M_subset, K_subset, Vg = var_comps$Vg, Ve = var_comps$Ve) print(paste("h² =", h2, "-> PEV =", round(pev_mm, 6))) } # Compare with non-mixed model PEV using PCA pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) pev_regular <- pev_mean(train_set, test_set, PC_subset) print(paste("Regular PEV (PCA):", round(pev_regular, 6))) print("Mixed model accounts for genetic relationships via kinship matrix")# Load wheat genomic data data(WheatData) # Create a subset for mixed model demonstration set.seed(987) subset_indices <- sample(1:nrow(Wheat.M), 60) M_subset <- Wheat.M[subset_indices, 1:50] K_subset <- Wheat.K[subset_indices, subset_indices] # Define train and test sets all_individuals <- rownames(M_subset) test_set <- sample(all_individuals, 12) train_set <- sample(setdiff(all_individuals, test_set), 18) # Mixed model PEV with different heritability scenarios h2_values <- c(0.2, 0.5, 0.8) print("PEV Mean MM - Different Heritabilities:") for (h2 in h2_values) { var_comps <- h2_to_variances(h2) pev_mm <- pev_mean_mm(train_set, test_set, M_subset, K_subset, Vg = var_comps$Vg, Ve = var_comps$Ve) print(paste("h² =", h2, "-> PEV =", round(pev_mm, 6))) } # Compare with non-mixed model PEV using PCA pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) pev_regular <- pev_mean(train_set, test_set, PC_subset) print(paste("Regular PEV (PCA):", round(pev_regular, 6))) print("Mixed model accounts for genetic relationships via kinship matrix")
Mean Prediction Error Variance for Mixed Models using heritability
pev_mean_mm_h2( train, test = NULL, P, K, h2, lambda = 1e-06, C = NULL, total_var = 1, normalized = FALSE )pev_mean_mm_h2( train, test = NULL, P, K, h2, lambda = 1e-06, C = NULL, total_var = 1, normalized = FALSE )
train |
Vector of training set individual names |
test |
Vector of test set individual names (if NULL, uses training set) |
P |
Prediction matrix with individuals as rows and markers as columns |
K |
Kinship/relationship matrix (symmetric, positive definite) |
h2 |
Heritability (between 0 and 1) |
lambda |
Ridge regularization parameter (default: 1e-6) |
C |
Contrast matrix (default: NULL) |
total_var |
Total variance (default: 1) |
normalized |
If TRUE, return PEV as proportion of total variance |
Mean PEV value
Plot Pareto front progress
plot_pareto_progress(fitness, criteria, directions, generation, axes_labels)plot_pareto_progress(fitness, criteria, directions, generation, axes_labels)
fitness |
Fitness matrix |
criteria |
Criteria names |
directions |
Plot directions |
generation |
Current generation |
axes_labels |
Axis labels |
Calculate population diversity metrics
population_distances(solutions, method = c("hamming", "jaccard", "sorensen"))population_distances(solutions, method = c("hamming", "jaccard", "sorensen"))
solutions |
List of solutions in the population |
method |
Distance calculation method: "hamming", "jaccard", "sorensen" (default: "jaccard") |
Distance matrix or diversity summary statistics
Rank-based selection with selection pressure control
rank_selection( population, fitness, n_select, selection_pressure = 1.5, method = "linear" )rank_selection( population, fitness, n_select, selection_pressure = 1.5, method = "linear" )
population |
List of solutions |
fitness |
Vector of fitness values |
n_select |
Number of individuals to select |
selection_pressure |
Selection pressure parameter (1.0-2.0, default: 1.5) |
method |
Rank selection method: "linear", "exponential" (default: "linear") |
Indices of selected individuals
# Create a sample population and fitness values set.seed(123) population <- list( c("ind1", "ind2", "ind3"), c("ind2", "ind4", "ind5"), c("ind1", "ind3", "ind6"), c("ind4", "ind5", "ind6"), c("ind1", "ind4", "ind7") ) # Fitness values (lower is better for minimization) fitness <- c(10.5, 8.2, 12.1, 9.8, 7.5) # Rank-based selection with moderate pressure selected_indices <- rank_selection(population, fitness, n_select = 3, selection_pressure = 1.4, method = "linear") print("Selected indices:") print(selected_indices) print("Corresponding fitness values:") print(fitness[selected_indices]) # Compare different selection pressures low_pressure <- rank_selection(population, fitness, 3, 1.1, "linear") high_pressure <- rank_selection(population, fitness, 3, 1.8, "linear") print("Low pressure selection:") print(fitness[low_pressure]) print("High pressure selection:") print(fitness[high_pressure]) # Exponential ranking exp_selection <- rank_selection(population, fitness, 3, 1.3, "exponential") print("Exponential ranking selection:") print(fitness[exp_selection])# Create a sample population and fitness values set.seed(123) population <- list( c("ind1", "ind2", "ind3"), c("ind2", "ind4", "ind5"), c("ind1", "ind3", "ind6"), c("ind4", "ind5", "ind6"), c("ind1", "ind4", "ind7") ) # Fitness values (lower is better for minimization) fitness <- c(10.5, 8.2, 12.1, 9.8, 7.5) # Rank-based selection with moderate pressure selected_indices <- rank_selection(population, fitness, n_select = 3, selection_pressure = 1.4, method = "linear") print("Selected indices:") print(selected_indices) print("Corresponding fitness values:") print(fitness[selected_indices]) # Compare different selection pressures low_pressure <- rank_selection(population, fitness, 3, 1.1, "linear") high_pressure <- rank_selection(population, fitness, 3, 1.8, "linear") print("Low pressure selection:") print(fitness[low_pressure]) print("High pressure selection:") print(fitness[high_pressure]) # Exponential ranking exp_selection <- rank_selection(population, fitness, 3, 1.3, "exponential") print("Exponential ranking selection:") print(fitness[exp_selection])
Ridge regression with optimal lambda selection
ridge_regression_cv(y, X, lambda_seq = NULL, cv_folds = 5, criterion = "mse")ridge_regression_cv(y, X, lambda_seq = NULL, cv_folds = 5, criterion = "mse")
y |
Response vector |
X |
Design matrix |
lambda_seq |
Sequence of lambda values to test (default: NULL for automatic) |
cv_folds |
Number of cross-validation folds (default: 5) |
criterion |
Selection criterion: "mse", "aic", "bic" (default: "mse") |
List with optimal lambda and fitted model
Efficient matrix inversion with numerical stability
safe_matrix_inverse(X, lambda = 0, method = "cholesky")safe_matrix_inverse(X, lambda = 0, method = "cholesky")
X |
Matrix to invert |
lambda |
Ridge parameter for regularization (default: 0) |
method |
Inversion method: "cholesky", "svd", "eigen" (default: "cholesky") |
Inverse matrix
Select diverse subset using crowding distance
select_diverse_subset(fitness, max_size)select_diverse_subset(fitness, max_size)
fitness |
Fitness matrix |
max_size |
Maximum number of solutions to select |
Indices of selected solutions
Simple hash-based cache for computation results
simple_cache(key, value = NULL, cache_env = NULL)simple_cache(key, value = NULL, cache_env = NULL)
key |
Cache key (will be hashed) |
value |
Value to cache (only if not retrieving) |
cache_env |
Environment to store cache (default: private environment) |
Cached value if exists, NULL otherwise
Functions to ensure consistent naming throughout the package
standardize_name(name)standardize_name(name)
name |
Function or variable name |
Standardized snake_case name
Deniz Akdemir Convert function names to snake_case
Standardize common parameter names across functions
standardize_parameters(...)standardize_parameters(...)
... |
Named parameters to standardize |
List with standardized parameter names and values
Clean, well-organized genetic algorithm functions for subset selection
subset_ga( P, Candidates, Test = NULL, ntoselect, npop = 100, nelite = 5, keepbest = TRUE, tabu = TRUE, tabumemsize = 1, mutprob = 0.8, mutintensity = 1, niterations = 500, minitbefstop = 200, niterreg = 5, convergence_window_multiplier = 4, enable_restart = TRUE, restart_threshold = 0.5, max_restarts = 2, lambda = 1e-06, plotiters = FALSE, plottype = 1, criterion = "PEVMEAN2", C = NULL, mc.cores = 1, InitPop = NULL, tolconv = 1e-07, Vg = NULL, Ve = NULL, Fedorov = FALSE, adaptive_mutation = TRUE, selection_method = "tournament", tournament_size = 3, selection_pressure = 1.5, diversity_preservation = TRUE, diversity_method = "crowding", crowding_factor = 3, sharing_radius = 0.1, crossover_method = "adaptive", diversity_target = 0.3, verbose = FALSE )subset_ga( P, Candidates, Test = NULL, ntoselect, npop = 100, nelite = 5, keepbest = TRUE, tabu = TRUE, tabumemsize = 1, mutprob = 0.8, mutintensity = 1, niterations = 500, minitbefstop = 200, niterreg = 5, convergence_window_multiplier = 4, enable_restart = TRUE, restart_threshold = 0.5, max_restarts = 2, lambda = 1e-06, plotiters = FALSE, plottype = 1, criterion = "PEVMEAN2", C = NULL, mc.cores = 1, InitPop = NULL, tolconv = 1e-07, Vg = NULL, Ve = NULL, Fedorov = FALSE, adaptive_mutation = TRUE, selection_method = "tournament", tournament_size = 3, selection_pressure = 1.5, diversity_preservation = TRUE, diversity_method = "crowding", crowding_factor = 3, sharing_radius = 0.1, crossover_method = "adaptive", diversity_target = 0.3, verbose = FALSE )
P |
Prediction matrix (individuals x markers/predictors) |
Candidates |
Vector of candidate individual names |
Test |
Vector of test individual names |
ntoselect |
Number of individuals to select for training |
npop |
Population size (default: 100) |
nelite |
Number of elite individuals to keep (default: 5) |
keepbest |
Whether to keep best solution across generations (default: TRUE) |
tabu |
Whether to use tabu search (default: TRUE) |
tabumemsize |
Tabu memory size (default: 1) |
mutprob |
Mutation probability (default: 0.8) |
mutintensity |
Mutation intensity (default: 1) |
niterations |
Maximum iterations (default: 500) |
minitbefstop |
Minimum iterations before stopping (default: 200) |
niterreg |
Iterations for convergence check (default: 5) |
convergence_window_multiplier |
Multiplier for convergence window size (default: 4) |
enable_restart |
Whether to enable restart mechanism for premature convergence (default: TRUE) |
restart_threshold |
Threshold for triggering restart (default: 0.5) |
max_restarts |
Maximum number of restarts allowed (default: 2) |
lambda |
Ridge regression parameter (default: 1e-6) |
plotiters |
Whether to plot during iterations (default: FALSE) |
plottype |
Type of plot (default: 1) |
criterion |
Optimization criterion (default: "PEVMEAN2") |
C |
Contrast matrix (default: NULL) |
mc.cores |
Number of cores for parallel processing (default: 1) |
InitPop |
Initial population (default: NULL) |
tolconv |
Convergence tolerance (default: 1e-7) |
Vg |
Genetic variance matrix for mixed models (default: NULL) |
Ve |
Error variance matrix for mixed models (default: NULL) |
Fedorov |
Whether to use Fedorov exchange (default: FALSE) |
adaptive_mutation |
Whether to use adaptive mutation rates (default: TRUE) |
selection_method |
Selection method: "tournament", "elite", "rank", "hybrid" (default: "tournament") |
tournament_size |
Tournament size for tournament selection (default: 3) |
selection_pressure |
Selection pressure for rank-based selection (default: 1.5) |
diversity_preservation |
Whether to preserve diversity (default: TRUE) |
diversity_method |
Diversity method: "crowding", "sharing", "both" (default: "crowding") |
crowding_factor |
Factor for crowding replacement (default: 3) |
sharing_radius |
Radius for fitness sharing (default: 0.1) |
crossover_method |
Crossover method (default: "adaptive") |
diversity_target |
Target diversity level (default: 0.3) |
verbose |
Whether to print verbose output (default: FALSE) |
List containing best solution, fitness history, and statistics
Deniz Akdemir Primary genetic algorithm for subset selection with training/test split
# Load wheat genomic data data(WheatData) # Create a manageable subset for demonstration set.seed(123) subset_indices <- sample(1:nrow(Wheat.M), 100) M_subset <- Wheat.M[subset_indices, 1:30] # Extract principal components for genetic algorithm pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define candidate and test sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 20) candidates <- setdiff(all_individuals, test_set) # Basic genetic algorithm run ga_result <- subset_ga( P = PC_subset, Candidates = candidates, Test = test_set, ntoselect = 15, npop = 20, niterations = 30, criterion = "pev_mean", verbose = FALSE ) print(paste("Best fitness:", round(ga_result$best_fitness, 6))) print(paste("Convergence generation:", ga_result$convergence_generation)) print(paste("Selected individuals:", length(ga_result$best_solution))) # Enhanced genetic algorithm with rank-based selection and restart ga_enhanced <- subset_ga( P = PC_subset, Candidates = candidates, Test = test_set, ntoselect = 15, npop = 20, niterations = 30, criterion = "pev_mean", selection_method = "rank", selection_pressure = 1.4, enable_restart = TRUE, restart_threshold = 0.6, max_restarts = 1, verbose = FALSE ) print(paste("Enhanced GA fitness:", round(ga_enhanced$best_fitness, 6))) print(paste("Restart count:", ga_enhanced$restart_count)) # Generate convergence diagnostics diags <- convergence_diagnostics(ga_enhanced, plot = FALSE) print(paste("Converged:", diags$convergence_status$converged)) print(paste("Convergence reason:", diags$convergence_status$reason))# Load wheat genomic data data(WheatData) # Create a manageable subset for demonstration set.seed(123) subset_indices <- sample(1:nrow(Wheat.M), 100) M_subset <- Wheat.M[subset_indices, 1:30] # Extract principal components for genetic algorithm pca_result <- prcomp(M_subset, center = TRUE, scale. = TRUE) PC_subset <- pca_result$x[, 1:5] rownames(PC_subset) <- rownames(M_subset) # Define candidate and test sets all_individuals <- rownames(PC_subset) test_set <- sample(all_individuals, 20) candidates <- setdiff(all_individuals, test_set) # Basic genetic algorithm run ga_result <- subset_ga( P = PC_subset, Candidates = candidates, Test = test_set, ntoselect = 15, npop = 20, niterations = 30, criterion = "pev_mean", verbose = FALSE ) print(paste("Best fitness:", round(ga_result$best_fitness, 6))) print(paste("Convergence generation:", ga_result$convergence_generation)) print(paste("Selected individuals:", length(ga_result$best_solution))) # Enhanced genetic algorithm with rank-based selection and restart ga_enhanced <- subset_ga( P = PC_subset, Candidates = candidates, Test = test_set, ntoselect = 15, npop = 20, niterations = 30, criterion = "pev_mean", selection_method = "rank", selection_pressure = 1.4, enable_restart = TRUE, restart_threshold = 0.6, max_restarts = 1, verbose = FALSE ) print(paste("Enhanced GA fitness:", round(ga_enhanced$best_fitness, 6))) print(paste("Restart count:", ga_enhanced$restart_count)) # Generate convergence diagnostics diags <- convergence_diagnostics(ga_enhanced, plot = FALSE) print(paste("Converged:", diags$convergence_status$converged)) print(paste("Convergence reason:", diags$convergence_status$reason))
Multi-objective optimization for subset selection
subset_ga_multiobjective( Pcs = NULL, Dist = NULL, Kernel = NULL, candidates, test, ntoselect, criteria, criteria_types, plot_directions, npop = 100, mutprob = 0.8, mutintensity = 1, niterations = 500, lambda = 1e-06, plot_iterations = FALSE, mc.cores = 1, initial_population = NULL, C = NULL, axes_labels = NULL, archive_size = 200, diversity_maintenance = TRUE )subset_ga_multiobjective( Pcs = NULL, Dist = NULL, Kernel = NULL, candidates, test, ntoselect, criteria, criteria_types, plot_directions, npop = 100, mutprob = 0.8, mutintensity = 1, niterations = 500, lambda = 1e-06, plot_iterations = FALSE, mc.cores = 1, initial_population = NULL, C = NULL, axes_labels = NULL, archive_size = 200, diversity_maintenance = TRUE )
Pcs |
Principal components matrix (can be NULL) |
Dist |
Distance matrix (can be NULL) |
Kernel |
Kernel matrix (can be NULL) |
candidates |
Vector of candidate individual names |
test |
Vector of test individual names |
ntoselect |
Number of individuals to select for training |
criteria |
Vector of optimization criteria names |
criteria_types |
Vector indicating minimize/maximize for each criterion |
plot_directions |
Vector for plotting directions (1 or -1 for each criterion) |
npop |
Population size (default: 100) |
mutprob |
Mutation probability (default: 0.8) |
mutintensity |
Mutation intensity (default: 1) |
niterations |
Maximum iterations (default: 500) |
lambda |
Ridge regression parameter (default: 1e-6) |
plot_iterations |
Whether to plot during iterations (default: FALSE) |
mc.cores |
Number of cores for parallel processing (default: 1) |
initial_population |
Initial population (default: NULL) |
C |
Contrast matrix (default: NULL) |
axes_labels |
Labels for plot axes (default: NULL) |
archive_size |
Maximum size of Pareto archive (default: 200) |
diversity_maintenance |
Whether to maintain diversity (default: TRUE) |
List containing Pareto front and optimization results
Deniz Akdemir Multi-objective genetic algorithm for subset selection with test set
Multi-objective GA without test set (single population)
subset_ga_multiobjective_single( Pcs = NULL, Dist = NULL, Kernel = NULL, candidates, ntoselect, criteria, criteria_types, plot_directions, ... )subset_ga_multiobjective_single( Pcs = NULL, Dist = NULL, Kernel = NULL, candidates, ntoselect, criteria, criteria_types, plot_directions, ... )
Pcs |
Principal components matrix |
Dist |
Distance matrix |
Kernel |
Kernel matrix |
candidates |
Candidate individuals |
ntoselect |
Number to select |
criteria |
Optimization criteria |
criteria_types |
Minimize/maximize indicators |
plot_directions |
Plot directions |
... |
Additional parameters |
Multi-objective optimization results
Genetic algorithm for subset selection without test set (single objective)
subset_ga_single(P, ntoselect, ...)subset_ga_single(P, ntoselect, ...)
P |
Prediction matrix (individuals x markers/predictors) |
ntoselect |
Number of individuals to select |
... |
Additional parameters (same as subset_ga but without Test parameter) |
List containing best solution and statistics
Tournament selection
tournament_selection(population, fitness, tournament_size = 3, n_select = 1)tournament_selection(population, fitness, tournament_size = 3, n_select = 1)
population |
List of solutions |
fitness |
Vector of fitness values |
tournament_size |
Size of tournament (default: 3) |
n_select |
Number of individuals to select (default: 1) |
Indices of selected individuals
Fitness transformation and normalization utilities
transform_fitness(fitness, method = "linear", scaling_factor = 2)transform_fitness(fitness, method = "linear", scaling_factor = 2)
fitness |
Vector of fitness values |
method |
Transformation method: "linear", "rank", "tournament" (default: "linear") |
scaling_factor |
Scaling factor for transformations (default: 2) |
Transformed fitness values
Update Pareto archive with new solutions
update_pareto_archive( population, fitness, archive, archive_fitness, max_size, criteria_types )update_pareto_archive( population, fitness, archive, archive_fitness, max_size, criteria_types )
population |
Current population |
fitness |
Fitness matrix |
archive |
Current archive |
archive_fitness |
Archive fitness matrix |
max_size |
Maximum archive size |
criteria_types |
Optimization types (minimize/maximize) |
Updated archive and fitness
Update progress bar
update_progress(pb, increment = 1)update_progress(pb, increment = 1)
pb |
Progress bar object |
increment |
How much to increment (default: 1) |
Validate kinship matrix
validate_kinship_matrix(K, individual_names)validate_kinship_matrix(K, individual_names)
K |
Kinship matrix |
individual_names |
Names of individuals to check against |
NULL if valid, throws error if invalid
Comprehensive validation functions for STPGA package
validate_matrix_params( P, K = NULL, lambda = 1e-06, train = NULL, test = NULL, C = NULL )validate_matrix_params( P, K = NULL, lambda = 1e-06, train = NULL, test = NULL, C = NULL )
P |
Prediction matrix (individuals x variables) |
K |
Kinship matrix (optional, for mixed models) |
lambda |
Ridge regularization parameter |
train |
Training set names |
test |
Test set names (optional) |
C |
Contrast matrix (optional) |
NULL if valid, throws error if invalid
Deniz Akdemir Validate matrix parameters for optimization criteria
Validate and adjust selection parameters
validate_selection_parameters( selection_method, tournament_size = 3, selection_pressure = 1.5, population_size )validate_selection_parameters( selection_method, tournament_size = 3, selection_pressure = 1.5, population_size )
selection_method |
Selection method name |
tournament_size |
Tournament size (for tournament selection) |
selection_pressure |
Selection pressure (for rank selection) |
population_size |
Population size |
List with validated parameters and warnings
# Test parameter validation with valid parameters validation_good <- validate_selection_parameters( selection_method = "rank", tournament_size = 3, selection_pressure = 1.5, population_size = 50 ) print("Valid parameters:") print(paste("Method:", validation_good$selection_method)) print(paste("Tournament size:", validation_good$tournament_size)) print(paste("Selection pressure:", validation_good$selection_pressure)) print(paste("Warnings:", length(validation_good$warnings))) # Test parameter validation with invalid parameters validation_bad <- validate_selection_parameters( selection_method = "invalid_method", tournament_size = 100, # Too large selection_pressure = 3.0, # Out of range population_size = 25 ) print("Invalid parameters corrected:") print(paste("Method corrected to:", validation_bad$selection_method)) print(paste("Tournament size corrected to:", validation_bad$tournament_size)) print(paste("Selection pressure corrected to:", validation_bad$selection_pressure)) print(paste("Number of warnings:", length(validation_bad$warnings))) for (warning in validation_bad$warnings) { print(paste("Warning:", warning)) }# Test parameter validation with valid parameters validation_good <- validate_selection_parameters( selection_method = "rank", tournament_size = 3, selection_pressure = 1.5, population_size = 50 ) print("Valid parameters:") print(paste("Method:", validation_good$selection_method)) print(paste("Tournament size:", validation_good$tournament_size)) print(paste("Selection pressure:", validation_good$selection_pressure)) print(paste("Warnings:", length(validation_good$warnings))) # Test parameter validation with invalid parameters validation_bad <- validate_selection_parameters( selection_method = "invalid_method", tournament_size = 100, # Too large selection_pressure = 3.0, # Out of range population_size = 25 ) print("Invalid parameters corrected:") print(paste("Method corrected to:", validation_bad$selection_method)) print(paste("Tournament size corrected to:", validation_bad$tournament_size)) print(paste("Selection pressure corrected to:", validation_bad$selection_pressure)) print(paste("Number of warnings:", length(validation_bad$warnings))) for (warning in validation_bad$warnings) { print(paste("Warning:", warning)) }
Validate variance components
validate_variance_components(Vg, Ve, h2 = NULL)validate_variance_components(Vg, Ve, h2 = NULL)
Vg |
Genetic variance component |
Ve |
Error variance component |
h2 |
Heritability (optional, for cross-validation) |
NULL if valid, throws error if invalid
Containing the phenotypic observations 'Wheat.Y', markers 'Wheat.M' and genetic relationships 'Wheat.K'. The 4670 markers available for these 200 genotypes were pre-porecessed for missingness and minor ellele frequencies, coded numerically as -1, 0, and 1; the relationship genomic relationship matrix was calculated from these markers.
This dataset was obtained from https://triticeaetoolbox.org/.