The Must-Know Interview Questions for Evaluating ML Algorithms
How interviewers reason about loss functions, assumptions, and failure modes
Introduction
If you spend enough time preparing for machine learning interviews, something odd starts to happen. No matter which algorithm you study: linear regression, decision trees, SVMs, kNN, XGBoost, the questions begin to repeat.
You are asked about loss functions, about missing data. About imbalance, assumptions, overfitting, interpretability. Interviewers are not testing whether you remember algorithms. They are testing whether you understand how to reason about models.
Instead of explaining algorithms one by one, we walk through the exact questions interviewers mentally apply to everymodel. For each question, we analyze how common algorithms behave, where they work well, where they break, and why.
Questions that we will answer:
Q1. What loss function does the algorithm optimize?
Q2. How does the algorithm handle missing data?
Q3. How does the algorithm handle imbalanced data?
Q4. What assumptions does the algorithm make about the data?
Q5. Where does the algorithm lie on the bias–variance spectrum
Q6. How does the algorithm handle overfitting and regularization?
Q7. How sensitive is the algorithm to feature scaling and outliers?
Q8. How does the algorithm behave in high-dimensional data?
Q9. How interpretable is the model?
Q10. How does the model handle sparse features?
Q11. How does the algorithm handle correlated features?
Q12. When should you NOT use a model?
If you can answer these questions confidently, you can reason about any classical machine learning model, even ones you haven’t seen before. That is the level interviewers look for at senior applied scientist and data scientist roles.
Q1. What loss function does the algorithm optimize?
Every machine learning algorithm optimizes an objective function, either explicitly (via a defined loss) or implicitly (via greedy or heuristic criteria). The choice of loss determines what the model considers an error and how strongly different mistakes are penalized.
Below are the most commonly asked algorithms and the exact objectives they optimize.
Linear Regression: Mean Squared Error (MSE)
Penalizes large errors quadratically. Convex objective with a closed-form solution.
What this means in words:
• The model is penalized more for large errors than small ones
• Squaring the error makes outliers very influential
• The model tries to fit the average relationship in the data
Logistic Regression: Log Loss (Negative Log-Likelihood)
Strongly penalizes confident wrong predictions. Convex objective.
What this means in words:
• Confident wrong predictions are punished heavily
• Correct but uncertain predictions are still penalized
• The model is encouraged to output well-calibrated probabilities
Support Vector Machine (SVM): Hinge Loss
Focuses on margin violations rather than probabilities. Convex objective.
What this means in words:
• Only points near the decision boundary matter
• Correctly classified points far from the margin are ignored
• The model focuses on maximizing separation between classes
k-Nearest Neighbors (kNN): No Global Loss
kNN does not optimize a global objective function.
Predictions are made using local distance-based voting at inference time.
Naive Bayes: Maximum Likelihood / Posterior Maximization
Equivalent to maximizing likelihood under the conditional independence assumption.
What this means in words:
• Each feature contributes independently to the prediction
• The model combines evidence multiplicatively
• Strong independence assumptions simplify learning
Decision Tree: Impurity Minimization (Greedy)
Gini Impurity
Entropy
Optimized greedily at each split. No global loss function.
What this means in words:
• Each split tries to make child nodes purer than the parent
• The model learns simple, rule-based decisions
• Decisions are made greedily, not globally
Random Forest: Ensemble of Greedy Trees
No single global objective across the forest.
Each tree independently minimizes impurity; the ensemble reduces variance via averaging.
Gradient Boosting: Additive Loss Minimization
Sequentially adds weak learners to minimize a user-defined differentiable loss.
What this means in words:
• Each new model focuses on correcting past mistakes
• Errors are reduced step by step
• Weak learners combine into a strong model
XGBClassifier: Regularized Boosting Objective
Adds explicit regularization to control tree complexity and prevent overfitting.
XGBRegressor: Regularized Regression Objective
What this means in words:
• The first term in both minimizes prediction error
• The second term penalizes tree complexity
• γ controls the cost of adding new leaves
• λ controls leaf weight magnitude
LightGBM: Histogram-based Gradient Boosting
Optimizes the same regularized boosting objective as Gradient Boosting but uses histogram-based splits and leaf-wise tree growth for efficiency.
AdaBoost: Exponential Loss
What this means in words:
• Misclassified points become increasingly important
• The model aggressively focuses on hard examples
• Noisy labels can dominate learning if not controlled
Q2. How does the algorithm handle missing data?
Handling of missing data depends on whether the algorithm’s mathematical formulation can operate on incomplete feature vectors. Some models require explicit preprocessing, while others can incorporate missingness directly into training.
Linear Regression
• Cannot handle missing values directly
• Loss computation requires complete feature vectors
• Missing values must be imputed or rows dropped
• Missingness information is lost during preprocessing
Logistic Regression
• Same behavior as linear regression
• Probability computation breaks with missing inputs
• Requires imputation before training and inference
• Poor imputation can shift the decision boundary
Support Vector Machine (SVM)
• Does not support missing values natively
• Margin and kernel computations require complete data
• Missing values distort geometric relationships
• Imputation is mandatory
k-Nearest Neighbors (kNN)
• Extremely sensitive to missing values
• Distance metrics become undefined with missing components
• Partial-distance heuristics are unreliable
• Performance degrades rapidly with poor imputation
Naive Bayes
• Can naturally handle missing values
• Likelihood computed using only observed features
• Missing features contribute no evidence
• Works due to conditional independence assumption
Decision Tree
• Supports missing values natively
• Uses surrogate splits or default directions
• Missingness itself can be predictive
• No explicit imputation required
Random Forest
• Inherits missing data handling from trees
• Different trees may route missing values differently
• Ensemble averaging stabilizes predictions
• Robust to moderate missingness
Gradient Boosting (GBM)
• Missing value handling depends on implementation
• Many implementations support default split directions
• Missingness patterns can be learned across iterations
• Should not assume native support blindly
XGBoost (Classifier)
• Handles missing values natively
• Learns optimal default direction at each split
• Missing values treated as informative signals
• Imputation often unnecessary
XGBRegressor
• Same missing value handling as XGBoost classifier
• Regression trees learn optimal routing paths
• Minimizes error even with incomplete inputs
• Very effective for real-world tabular regression
LightGBM
• Handles missing values natively
• Treats missing values as a separate histogram bin
• Efficient for large-scale data
• Learns missingness patterns directly
AdaBoost
• Does not support missing values natively
• Weak learners assume complete data
• Sample reweighting amplifies noise from missing values
• Imputation required before training
Q3. How does the algorithm handle imbalanced data?
Imbalanced data affects how errors are perceived during training. Many algorithms implicitly optimize accuracy, which biases them toward the majority class unless corrective mechanisms such as reweighting, resampling, or loss modification are applied.
Logistic Regression
• Naturally biased toward majority class
• Optimizes log loss without class awareness by default
• Supports class-weighted loss
• Class weights increase penalty for minority misclassification
Support Vector Machine (SVM)
• Margin influenced by majority class density
• Minority class points may be ignored
• Supports class-specific penalty parameters
• Higher penalty forces better minority separation
k-Nearest Neighbors (kNN)
• Strongly biased toward majority class
• Majority class dominates neighborhood counts
• No intrinsic imbalance correction
• Can use:
Distance weighting
Balanced sampling
Different k per class
Naive Bayes
• Sensitive to class prior probabilities
• Majority class prior dominates posterior
• Can rebalance by modifying class priors
• Works better when likelihoods are highly informative
Decision Tree
• Impurity measures favor majority class
• Minority splits may be ignored early
• Supports class-weighted impurity
• Also supports balanced class sampling
Random Forest
• Same imbalance issues as decision trees
• Ensemble reduces variance, not bias
• Common fixes:
Class-weighted trees
Balanced bootstrap sampling
Adjusted decision thresholds
Gradient Boosting (GBM)
• Optimizes loss sequentially
• Minority errors persist longer across iterations
• Supports weighted loss functions
• Sensitive to noisy minority labels
XGBoost (Classifier)
• Explicit support for class imbalance
• Uses scale_pos_weight to rebalance gradients
• Affects gradient and Hessian computation
• More stable than resampling for large datasets
LightGBM
• Native support for class weights
• Efficient handling of large imbalanced datasets
• Leaf-wise growth may amplify imbalance if unchecked
• Requires careful regularization
AdaBoost
• Naturally emphasizes misclassified samples
• Minority samples gain weight quickly
• Can overfit noisy minority labels
• Requires early stopping or weight clipping
Q4. What assumptions does the algorithm make about the data?
Every machine learning algorithm encodes assumptions about how data is generated. These assumptions act as inductive bias. When they align with reality, the model performs well; when they are violated, performance degrades.
Linear Regression
• Assumes a linear relationship between features and target
• Assumes additive effects of features
• Assumes independent and identically distributed (i.i.d.) errors
• Assumes homoscedasticity (constant error variance)
• Assumes low multicollinearity among features
Violation effects:
• Biased coefficients
• Unstable estimates
• Poor extrapolation
Logistic Regression
• Assumes linear decision boundary in feature space
• Assumes log-odds are linear in features
• Assumes independent observations
• Assumes no strong multicollinearity
Violation effects:
• Underfitting on non-linear data
• Poor probability calibration
• Inflated coefficient variance
Support Vector Machine (SVM)
• Assumes data is separable (or nearly separable) in some feature space
• Kernel choice encodes assumptions about similarity
• Assumes margin-based separation is meaningful
Violation effects:
• Poor kernel choice leads to underfitting or overfitting
• Sensitive to noise near the margin
k-Nearest Neighbors (kNN)
• Assumes local smoothness of the target function
• Assumes nearby points have similar labels
• Assumes distance metric reflects true similarity
Violation effects:
• Curse of dimensionality
• Sensitivity to irrelevant features
• Poor performance in sparse spaces
Naive Bayes
• Assumes conditional independence of features given the class
• Assumes correct parametric form for feature distributions
Violation effects:
• Often still works surprisingly well
• Probability estimates become poorly calibrated
• Relative class ranking may remain accurate
Decision Tree
• Assumes data can be partitioned using axis-aligned rules
• Assumes hierarchical feature interactions
• No assumption of linearity or smoothness
Violation effects:
• High variance
• Unstable splits with small data changes
• Poor extrapolation beyond training range
Random Forest
• Same assumptions as decision trees
• Assumes variance can be reduced through averaging
• Assumes randomness decorrelates trees
Violation effects:
• Bias remains unchanged
• Interpretability decreases
• Poor performance on extrapolation tasks
Gradient Boosting (GBM)
• Assumes weak learners can iteratively reduce error
• Assumes additive model structure
• Sensitive to noise and outliers
Violation effects:
• Overfitting noisy patterns
• Slow convergence with poorly chosen loss
XGBoost (Classifier)
• Same assumptions as gradient boosting
• Assumes regularization controls complexity effectively
• Assumes tree-based feature interactions
Violation effects:
• Overfitting if regularization is weak
• Instability with extreme class noise
XGBRegressor
• Assumes regression function can be approximated by additive trees
• Assumes squared error (by default) is appropriate
• Captures non-linear, non-monotonic relationships
Violation effects:
• Poor performance on extreme extrapolation
• Sensitive to target outliers
LightGBM
• Same assumptions as boosting trees
• Assumes leaf-wise growth improves efficiency
• Assumes sufficient data to support deep leaves
Violation effects:
• Overfitting on small datasets
• Requires strong regularization
AdaBoost
• Assumes weak learners perform slightly better than random
• Assumes errors are informative
• Extremely sensitive to label noise
Violation effects:
• Exponential focus on noisy samples
• Rapid overfitting
Q5. Where does the algorithm lie on the bias–variance spectrum
The bias–variance tradeoff describes how a model balances simplicity against flexibility. High-bias models make strong assumptions and underfit, while high-variance models are flexible but sensitive to noise. Interviewers ask this to test whether you understand generalization, not just training accuracy.
Linear Regression
• High bias, low variance
• Strong linearity assumptions limit flexibility
• Stable predictions across datasets
• Underfits complex, non-linear relationships
Implication:
• Performs well with small data and simple patterns
• Fails when true relationships are complex
Logistic Regression
• High bias, low variance
• Linear decision boundary restricts expressiveness
• Stable probability estimates with sufficient data
Implication:
• Good baseline classifier
• Underfits non-linearly separable data
Support Vector Machine (SVM)
• Bias–variance depends on kernel and regularization
• Linear SVM → higher bias, lower variance
• RBF / polynomial kernels → lower bias, higher variance
Implication:
• Flexible but sensitive to kernel choice
• Can overfit with complex kernels
k-Nearest Neighbors (kNN)
• Low bias, high variance for small kk
• Bias increases as kk increases
• Variance decreases as neighborhoods grow
Implication:
• Small kk: fits noise
• Large kk: oversmooths decision boundary
Naive Bayes
• High bias, very low variance
• Strong independence assumptions dominate behavior
• Extremely stable across datasets
Implication:
• Works surprisingly well with limited data
• Rarely overfits, often underfits
Decision Tree
• Low bias, high variance
• Highly flexible and expressive
• Small data changes lead to different trees
Implication:
• Fits training data very well
• Prone to overfitting without constraints
Random Forest
• Lower variance than decision trees
• Bias similar to individual trees
• Variance reduced through averaging
Implication:
• Strong generalization on tabular data
• Rarely overfits with enough trees
Gradient Boosting (GBM)
• Low bias, potentially high variance
• Sequential error correction increases flexibility
• Sensitive to noise and learning rate
Implication:
• Excellent accuracy when tuned
• Requires careful regularization
XGBoost (Classifier)
• Low bias, controlled variance
• Explicit regularization stabilizes boosting
• Better bias–variance balance than vanilla GBM
Implication:
• Strong performance across many datasets
• Can still overfit if regularization is weak
XGBRegressor
• Low bias, controlled variance
• Models complex non-linear regression functions
• Sensitive to outliers due to squared loss
Implication:
• Excellent interpolation
• Requires regularization for noisy targets
LightGBM
• Very low bias, higher variance risk
• Leaf-wise growth increases model complexity
• Fast convergence amplifies overfitting risk
Implication:
• Very powerful on large datasets
• Dangerous on small datasets without tuning
AdaBoost
• Bias decreases rapidly, variance can explode
• Focuses aggressively on hard examples
• Extremely sensitive to noise
Implication:
• Strong on clean data
• Fails quickly with label noise
Q6. How does the algorithm handle overfitting and regularization?
Overfitting occurs when a model captures noise instead of signal. Different algorithms control overfitting in different ways: some through explicit penalties in the objective, others through structural constraints or implicit regularization.
Linear Regression
• Overfits when features are noisy or highly correlated
• Uses explicit regularization
L2 regularization (Ridge):
L1 regularization (Lasso):
• L2 shrinks coefficients
• L1 induces sparsity and feature selection
Logistic Regression
• Overfits with many features or weak signals
• Uses the same L1 / L2 penalties as linear regression
• Regularization directly controls decision boundary complexity
Implication:
• Regularization strength determines bias–variance tradeoff
Support Vector Machine (SVM)
• Uses margin maximization as implicit regularization
• Controlled by penalty parameter CC
• Large C → low bias, high variance
• Small C → high bias, low variance
k-Nearest Neighbors (kNN)
• No explicit regularization term
• Regularization is controlled by choice of k
• Small k → overfitting
• Large k → underfitting
This makes kNN an example of implicit regularization.
Naive Bayes
• Rarely overfits due to strong independence assumptions
• Bias acts as implicit regularizer
• No explicit regularization parameter
Result:
• Stable but often underfit
Decision Tree
• Extremely prone to overfitting
• Uses structural regularization
Common controls:
• Maximum depth
• Minimum samples per leaf
• Minimum impurity decrease
• Post-pruning
Implication:
• Tree size directly controls variance
Random Forest
• Overfitting reduced through bagging
• Feature subsampling decorrelates trees
• Number of trees does not cause overfitting
Key controls:
• Tree depth
• Minimum samples per leaf
• Number of features per split
Gradient Boosting (GBM)
• High risk of overfitting without constraints
• Uses multiple regularization mechanisms
Common controls:
• Learning rate (shrinkage)
• Number of boosting rounds
• Tree depth
• Early stopping
Implication:
• Small learning rate + many trees = better generalization
XGBoost (Classifier)
• Uses explicit regularization in the objective
• Penalizes number of leaves and leaf weights
• Supports early stopping
• Highly tunable regularization
Result:
• Strong control over overfitting
XGBRegressor
• Same regularization mechanisms as XGBoost classifier
• Particularly important due to squared error sensitivity
Controls:
• Tree depth
• Learning rate
• Regularization parameters (λ,γ)
LightGBM
• Uses similar regularization to XGBoost
• Leaf-wise growth increases overfitting risk
Key controls:
• Maximum depth
• Minimum data in leaf
• Feature fraction
• Bagging fraction
AdaBoost
• Overfitting controlled indirectly
• Early stopping is critical
• No explicit regularization term
Risk:
• Overfits rapidly with noisy data
Q7. How sensitive is the algorithm to feature scaling and outliers?
Feature scaling and outliers affect algorithms differently depending on whether they rely on distances, dot products, or ordering comparisons. Interviewers ask this to check whether you understand preprocessing requirements and robustness, not just model fitting.
Linear Regression
• Sensitive to outliers due to squared error loss
• Feature scaling does not change predictions, but affects:
Optimization speed
Numerical stability
• Large-magnitude features can dominate gradient updates
Implication:
• Scaling recommended
• Outlier handling (clipping, robust loss) often required
Logistic Regression
• Sensitive to outliers in feature space
• Feature scaling improves convergence and stability
• Unscaled features distort regularization effects
Implication:
• Scaling strongly recommended
• Outliers can lead to overconfident probabilities
Support Vector Machine (SVM)
• Highly sensitive to feature scaling
• Distance and margin computations depend on scale
• Outliers near margin can dominate optimization
Implication:
• Scaling is mandatory
• Robust kernels or soft margins needed for noisy data
k-Nearest Neighbors (kNN)
• Extremely sensitive to feature scaling
• Distance metric directly defines model behavior
• Outliers distort neighborhood structure
Implication:
• Scaling is mandatory
• Outlier removal significantly improves performance
Naive Bayes
• Scaling generally not required
• Outliers affect likelihood estimates depending on distribution
• Gaussian Naive Bayes sensitive to extreme values
Implication:
• Robust to scaling
• Sensitive to distributional mismatch
Decision Tree
• Insensitive to feature scaling
• Uses threshold-based splits
• Moderately robust to outliers
Implication:
• Scaling unnecessary
• Outliers may still affect split placement
Random Forest
• Same scaling behavior as decision trees
• Outliers diluted across trees
• More robust than a single tree
Implication:
• No scaling needed
• Handles outliers reasonably well
Gradient Boosting (GBM)
• Tree-based boosting is scale-invariant
• Sensitive to outliers through loss function
• Squared loss amplifies outlier influence
Implication:
• No scaling needed
• Robust losses improve stability
XGBoost (Classifier)
• Feature scaling not required
• Outliers influence gradients and Hessians
• Supports alternative loss functions
Implication:
• Robust with proper regularization
• Care needed for noisy targets
XGBRegressor
• Not sensitive to feature scaling
• Highly sensitive to target outliers
• Squared error dominates optimization
Implication:
• Consider robust losses or target transformation
LightGBM
• Scale-invariant for features
• Sensitive to outliers via loss function
• Histogram binning can dampen extreme values
Implication:
• No scaling required
• Still requires careful loss selection
AdaBoost
• Sensitive to outliers
• Misclassified outliers receive exponentially increasing weight
Implication:
• Outliers can dominate learning
• Requires clean labels or early stopping
Q8. How does the algorithm behave in high-dimensional data?
High-dimensional data refers to settings where the number of features is large relative to the number of samples, or where many features are irrelevant, redundant, or sparse. In such regimes, the geometry of the data changes, and algorithms behave very differently depending on what they rely on: distances, projections, or splits.
Linear and Logistic Regression
• Performance degrades with many irrelevant or weakly informative features
• Multicollinearity becomes more likely
• Variance of coefficient estimates increases
• Without regularization, the model overfits easily
What helps:
• L2 regularization to stabilize coefficients
• L1 regularization to perform feature selection
• Dimensionality reduction or careful feature engineering
Net effect:
• Can work well in high dimensions if regularized
• Fails when signal-to-noise ratio is low
Support Vector Machine (SVM)
• Performs surprisingly well in high dimensions when a clear margin exists
• Linear SVMs scale better than kernel SVMs
• Kernel SVMs become computationally infeasible as dimensionality and sample size grow
Why:
• Margin maximization depends on a small subset of points (support vectors)
• But kernel methods scale poorly with both features and samples
Net effect:
• Linear SVM is a strong choice for very high-dimensional sparse data
• Kernel SVMs are usually avoided
k-Nearest Neighbors (kNN)
• Suffers the most in high-dimensional spaces
• Distances between nearest and farthest neighbors become almost identical
• Nearest neighbors stop being meaningful
Why:
• Distance metrics lose contrast as dimensions increase
• Irrelevant features dominate similarity calculations
Net effect:
• Performance collapses rapidly
• kNN is generally unsuitable for high-dimensional data
Naive Bayes
• Handles high-dimensional data extremely well
• Commonly used in text and bag-of-words representations
• Independence assumption simplifies learning
Why:
• Each feature contributes independently
• Sparsity and dimensionality do not significantly increase variance
Net effect:
• Strong baseline for high-dimensional sparse problems
• Probability calibration may be poor, but classification remains effective
Decision Trees
• Can handle high-dimensional data but become unstable
• Tend to pick dominant features early
• High variance increases with feature count
Why:
• Greedy splitting over many features amplifies noise
• Small data changes can lead to different split choices
Net effect:
• Single trees overfit easily in high dimensions
• Rarely used alone in such settings
Random Forest
• More robust than a single tree
• Feature subsampling mitigates high dimensionality
• Still affected by many irrelevant features
Why:
• Random feature selection reduces correlation between trees
• Averaging reduces variance
Net effect:
• Performs reasonably well in high dimensions
• Feature importance becomes less reliable
Gradient Boosting / XGBoost / LightGBM
• Performs very well on high-dimensional tabular data
• Learns useful feature interactions
• Sensitive to noise and requires regularization
Why:
• Sequential learning focuses on residual structure
• Tree-based learners ignore irrelevant features naturally
Net effect:
• Often state-of-the-art for high-dimensional tabular problems
• Requires careful tuning to avoid overfitting
AdaBoost
• Can handle moderate dimensionality
• Sensitive to noisy and redundant features
Why:
• Misclassified points get increasing influence
• Noise accumulates faster in high dimensions
Net effect:
• Effective when signal is strong
• Unstable when noise dominates
Q9. How interpretable is the model?
Interpretability refers to how easily humans can understand why a model makes a particular prediction. This can mean understanding the model globally (overall behavior) or locally (individual predictions). Different algorithms trade interpretability for flexibility and performance in very different ways.
Interviewers ask this question to assess whether you understand trust, debugging, and real-world deployment constraints, not just accuracy.
Two types of interpretability
Global interpretability
• Understanding the overall logic of the model
• Knowing which features matter and how they affect predictions
Local interpretability
• Explaining a single prediction
• Answering “why did the model predict this outcome for this example?”
Different models excel at different types.
Linear Regression
• Highly interpretable globally
• Each coefficient represents the marginal effect of a feature
• Sign and magnitude of coefficients are meaningful
Limitations:
• Interpretation breaks under multicollinearity
• Assumes linear, additive effects
Net effect:
• Best model when interpretability is a priority
• Common in regulated domains
Logistic Regression
• Interpretable in terms of log-odds
• Coefficients indicate direction and strength of influence
• Easy to communicate to non-technical stakeholders
Limitations:
• Non-linear relationships are not captured
• Probabilities can be misinterpreted
Net effect:
• Strong balance between interpretability and performance
Support Vector Machine (SVM)
• Linear SVM is interpretable via weights and margin
• Kernel SVM is largely a black box
Why:
• Kernel trick hides the feature space transformation
Net effect:
• Interpretability depends entirely on kernel choice
k-Nearest Neighbors (kNN)
• Locally interpretable
• Prediction can be explained by pointing to nearest neighbors
Limitations:
• No global explanation
• Hard to summarize overall behavior
Net effect:
• Intuitive but not scalable for explanation
Naive Bayes
• Moderately interpretable
• Feature likelihoods indicate contribution to classes
Limitations:
• Independence assumption oversimplifies reality
• Probability estimates often poorly calibrated
Net effect:
• Useful for understanding dominant signals, not precise reasoning
Decision Tree
• Highly interpretable both globally and locally
• Decisions expressed as if-then rules
• Easy to visualize and debug
Limitations:
• Large trees become hard to interpret
• Small data changes can alter structure
Net effect:
• Gold standard for rule-based interpretability
Random Forest
• Individual trees are interpretable
• Ensemble behavior is not
• Feature importance is aggregated and approximate
Limitations:
• Feature importance can be misleading with correlated features
Net effect:
• Partial interpretability, mainly at feature level
Gradient Boosting / XGBoost / LightGBM
• Low inherent interpretability
• Feature importance is heuristic
• Decision logic is distributed across many trees
Why:
• Sequential error correction obscures reasoning
Net effect:
• Requires post-hoc explainability methods (e.g., SHAP)
AdaBoost
• Weak learners are interpretable
• Ensemble behavior is opaque
• Hard to trace final prediction logic
Net effect:
• Limited interpretability beyond feature importance
Post-hoc interpretability methods
Used when models are inherently complex:
• Feature importance
• Partial dependence plots
• SHAP / LIME explanations
Important caveat:
• These explain the model’s behavior, not ground truth
• They can be misleading if misused
Q10. How does the model handle sparse features?
Sparse features are features where most values are zero or missing. This is common in text data (bag-of-words, TF-IDF), recommender systems (user–item matrices), and high-dimensional tabular data with many optional attributes.
How well a model handles sparsity depends on:
• Whether it can ignore zero-valued features efficiently
• Whether zeros carry semantic meaning
• Whether the model relies on distances, dot products, or splits
Core challenge of sparse data
• Most features contain no information for a given sample
• Signal is spread across many dimensions
• Memory and computation can become inefficient
• Distance-based similarity becomes unreliable
Different algorithms react very differently to this structure.
Linear Regression
• Handles sparse features well mathematically
• Dot-product formulation naturally ignores zeros
• Efficient with sparse matrix representations
Limitations:
• Overfitting risk with many sparse, weak features
• Coefficients can become unstable without regularization
What helps:
• L1 regularization for feature selection
• L2 regularization for coefficient stability
Net effect:
• Performs well with sparse data when regularized
Logistic Regression
• Same sparsity behavior as linear regression
• Commonly used for high-dimensional sparse classification
• Works efficiently with sparse inputs
Limitations:
• Linear decision boundary limits expressiveness
• Needs regularization to suppress noise
Net effect:
• Strong baseline for sparse classification problems
Support Vector Machine (SVM)
• Linear SVM handles sparse features very well
• Kernel SVM scales poorly with sparse, high-dimensional data
Why:
• Linear SVM relies on dot products
• Kernel methods densify the representation
Net effect:
• Linear SVM is a strong choice for sparse data
• Kernel SVM is usually avoided
k-Nearest Neighbors (kNN)
• Performs poorly with sparse features
• Distance metrics break down when vectors are mostly zeros
• Similarity becomes dominated by noise
Why:
• Sparse vectors often look equally distant
• Irrelevant non-zero entries distort neighborhoods
Net effect:
• kNN is generally unsuitable for sparse data
Naive Bayes
• Extremely effective with sparse features
• Designed to work with high-dimensional sparse inputs
• Widely used in text classification
Why:
• Features contribute independently
• Missing or zero features simply add no evidence
Net effect:
• One of the best models for sparse categorical data
Decision Tree
• Handles sparse features inconsistently
• Zero values may dominate early splits
• Sparse signals can be ignored if infrequent
Why:
• Trees prefer features with strong, frequent splits
• Rare but important features may be missed
Net effect:
• Single trees are unreliable with extreme sparsity
Random Forest
• More robust than single trees
• Feature subsampling helps expose sparse signals
• Still biased toward frequently active features
Net effect:
• Works moderately well
• Feature importance may be misleading
Gradient Boosting / XGBoost / LightGBM
• Very strong performance with sparse features
• Explicitly optimized for sparse inputs
• Can learn interactions among rare features
Why:
• Trees naturally ignore zero-valued features
• Boosting focuses on residual signal
Net effect:
• Often state-of-the-art for sparse tabular data
XGBRegressor
• Same sparse-handling behavior as XGBoost classifier
• Sparse features do not harm optimization
• Efficient memory usage with sparse-aware algorithms
Net effect:
• Excellent for sparse regression problems
LightGBM
• Designed with native sparse optimization
• Treats missing and zero values efficiently
• Histogram-based splitting improves performance
Net effect:
• One of the best choices for large sparse datasets
AdaBoost
• Can struggle with extreme sparsity
• Weak learners may not capture rare signals
• Sensitive to noisy sparse features
Net effect:
• Works only when sparse features are informative and clean
Q11. How does the algorithm handle correlated features?
Correlated features are features that carry overlapping or redundant information. Correlation is common in real datasets due to duplicated signals, derived features, or measurement artifacts. Algorithms differ in how they react to this redundancy depending on whether they estimate coefficients, distances, or decision rules.
Why correlated features matter
• They do not necessarily hurt predictive accuracy
• They do affect coefficient stability and interpretability
• They can bias feature importance measures
• They can reduce the effectiveness of ensembling
The impact depends on the algorithm family.
Linear Regression
• Highly sensitive to correlated features (multicollinearity)
• Coefficient estimates become unstable
• Small data changes cause large coefficient shifts
What happens:
• Predictions may remain accurate
• Individual coefficients lose meaning
Mitigation:
• L2 regularization stabilizes coefficients
• L1 regularization selects one feature among correlated ones
• Dimensionality reduction (PCA)
Logistic Regression
• Same multicollinearity issues as linear regression
• Inflated variance in coefficient estimates
• Interpretation of odds ratios becomes unreliable
Mitigation:
• Regularization
• Feature selection
Support Vector Machine (SVM)
• Correlated features less problematic for prediction
• Redundant features increase computation
• Kernel methods can amplify redundancy
Net effect:
• Accuracy often unaffected
• Feature relevance harder to interpret
k-Nearest Neighbors (kNN)
• Correlated features distort distance metrics
• Redundant dimensions overweight certain signals
Result:
• Nearest neighbors become biased
• Model performance degrades
Mitigation:
• Feature scaling
• Dimensionality reduction
Naive Bayes
• Correlated features violate independence assumption
• Evidence is effectively double-counted
What happens:
• Probabilities become poorly calibrated
• Classification accuracy often remains reasonable
Net effect:
• Ranking may still work
• Confidence estimates are unreliable
Decision Tree
• Arbitrarily selects one feature among correlated ones
• Split selection becomes unstable
Result:
• Different trees choose different correlated features
• Feature importance becomes unreliable
Random Forest
• Correlated features reduce tree diversity
• Ensemble benefit diminishes
• Feature importance is biased toward correlated variables
Net effect:
• Accuracy often remains strong
• Interpretation suffers significantly
Gradient Boosting / XGBoost / LightGBM
• Handles correlated features reasonably well
• Tends to repeatedly select one dominant feature
• Importance scores are skewed
Why:
• Greedy splitting favors features with early gains
Net effect:
• Performance unaffected
• Feature attribution unreliable
XGBRegressor
• Same behavior as XGBoost classifier
• Correlated predictors are interchangeable
• Attribution instability increases
AdaBoost
• Sensitive to redundant weak learners
• May repeatedly focus on the same correlated signal
Result:
• Reduced ensemble diversity
• Faster overfitting
Q12. When should you NOT use a model?
A model should not be used when its failure modes align with your data reality.
1. When the model’s assumptions are clearly violated
Every model encodes assumptions. When these are badly violated, performance degrades in predictable ways.
Linear / Logistic Regression
Do not use when:
• Relationships are highly non-linear
• Feature interactions dominate outcomes
• Strong multicollinearity is present and interpretation matters
Why:
• The model underfits and gives misleading coefficients
Naive Bayes
Do not use when:
• Features are strongly dependent
• Accurate probability calibration is required
Why:
• Independence assumption is violated
• Probabilities become unreliable even if accuracy is decent
k-Nearest Neighbors (kNN)
Do not use when:
• Data is high-dimensional
• Features are sparse
• Low-latency inference is required
Why:
• Distances lose meaning
• Inference cost grows with dataset size
SVM (Kernel)
Do not use when:
• Dataset is very large
• Model must be interpretable
• Training time is constrained
Why:
• Kernel methods scale poorly
• Hard to explain decisions
2. When data size does not support model complexity
More complex models need more data to generalize.
Decision Tree
Do not use when:
• Dataset is small and noisy
• Stability is important
Why:
• Trees are high-variance models
• Small data changes produce different trees
Gradient Boosting / XGBoost / LightGBM
Do not use when:
• Dataset is extremely small
• Labels are very noisy
• You cannot tune hyperparameters carefully
Why:
• Boosting amplifies noise
• Easy to overfit without regularization
Deep Ensembles (in general)
Do not use when:
• Simpler models already perform well
• Interpretability is required
• Debuggability is critical
Why:
• Complexity adds fragility without guaranteed gains
3. When interpretability or trust is a hard requirement
Some problems prioritize explainability over raw accuracy.
Do not use complex models when:
• Decisions affect humans directly (finance, healthcare, policy)
• Regulatory compliance is required
• Stakeholders need clear reasoning
Avoid
• XGBoost / LightGBM
• Kernel SVMs
• Large ensembles
Prefer
• Linear models
• Decision trees
• Rule-based systems
4. When computational constraints dominate
Some models are impractical despite good accuracy.
kNN
Do not use when:
• Real-time inference is needed
• Dataset is large
Why:
• Prediction requires scanning the dataset
Kernel SVM
Do not use when:
• Data size grows beyond tens of thousands
• Memory is limited
Boosting Models
Do not use when:
• Latency budgets are extremely tight
• Model size must be minimal
5. When data properties actively harm the model
Severe class imbalance + noisy labels
Avoid:
• AdaBoost
• Aggressive boosting
Why:
• Misclassified noisy points dominate learning
Heavy-tailed targets with squared loss
Avoid:
• XGBRegressor with default loss
Why:
• Outliers dominate optimization
6. When simpler baselines already solve the problem
Do not use complex models when:
• Linear or logistic regression performs competitively
• Feature engineering explains most variance
• Gains from complexity are marginal
Why:
• Simpler models are easier to debug, maintain, and trust
Conclusion
Most machine learning interviews are not about algorithms. They are about judgment.
When interviewers ask about loss functions, missing data, imbalance, assumptions, or failure modes, they are not checking recall. They are checking whether you understand how models behave when they meet real data, noisy, incomplete, high-dimensional, and imperfect.
If you want to prepare further and go deeper into interview-focused machine learning concepts, trade-offs, and real-world reasoning, please follow Interview Prep for more resources and upcoming posts.


