How do I know if my model is overfitting?

A common sign is a large gap between training and validation/test results: the model scores high on training data but drops on unseen data. Confirm with cross‑validation and learning curves, and rule out data leakage.

Can more data fix overfitting?

Often, yes, if the added data is diverse and representative of production. More coverage reduces variance and makes it harder for the model to memorize noise, but evaluation and leakage checks still matter.

What’s the fastest way to address underfitting?

Start by increasing model capacity (or choosing a more flexible algorithm) and improving features. If you’re using strong regularization or very restrictive hyperparameters, relax them and re‑evaluate with a consistent validation setup.

Overfitting and Underfitting in Machine Learning

Updated on January 30, 20266 minutes read

Machine learning models do not fail only because of “bad algorithms”. Many real-world issues come down to generalization: how well a model performs on new, unseen data compared to the data it was trained on.

Two classic failure modes explain most generalization problems. Overfitting means the model learns the training set too well. Underfitting means it does not learn enough to be useful.

What overfitting and underfitting mean

Overfitting and underfitting describe a mismatch between model complexity, data quality, and the signal you want to learn. You usually see it as a gap between training and validation or test results.

Fixing the issue starts with a clear diagnosis. If you change the model before you confirm the failure mode, You can waste time tuning settings that do not address the root cause.

Overfitting

Overfitting happens when a model captures patterns that are specific to the training set, including noise and random fluctuations. It can look excellent during training, then drop sharply on new data.

Overfitting is often a high-variance problem. Small changes in the training split can cause big changes in metrics and even in the model’s predictions.

Common signs of overfitting:

Very strong performance on training data, but weaker results on validation or test data
Metrics that vary widely across folds or across different random seeds
Predictions that are unstable when inputs change slightly

Underfitting

Underfitting happens when the model is too simple to capture relationships in the data. It performs poorly on both training and unseen data because it misses the signal.

Underfitting is typically a high-bias problem. The model makes overly simple assumptions and cannot represent the true pattern, even if you train longer.

Common signs of underfitting:

Low performance on both training and validation or test data
Learning curves that plateau early and do not meaningfully improve
Residual errors with clear structure (a hint that important patterns were missed)

Why this happens: the bias-variance trade-off

Most model choices sit on a spectrum between bias and variance. If you push complexity too high, you often reduce bias but increase variance, which raises the risk of overfitting.

If you simplify too much, variance drops but bias rises. That increases the risk of underfitting, where the model cannot learn the task.

Your goal is balance. Pick the simplest approach that captures the signal reliably and remains stable across new data.

How to diagnose the problem

Before changing algorithms, confirm where the model is failing. A quick diagnosis can save hours of tuning that cannot fix a data or split issue. Start by separating training behavior from generalization behavior.

1) Compare training vs validation performance

If training scores are high but validation or test scores are much lower, Overfitting is likely. If both are low, underfitting or dataset quality issues are more likely.

If both are unusually high, treat it as a potential data leakage warning. Leakage can make results look great while hiding real performance issues that appear later in production.

2) Plot learning curves

Learning curves show how performance changes with more data or more training. They help you decide whether to collect more data, increase model capacity, or add regularization.

Typical patterns:

Overfitting: training improves, validation stalls or worsens
Underfitting: both curves are poor and close together
Data limitation: both improve with more data, but validation lags training

3) Check for “silent” dataset problems

Some issues look like overfitting or underfitting, but are caused by the dataset. They often come from how data was split, labeled, or processed. Catching them early is one of the highest-impact steps you can take.

Watch for:

Leakage from the target into features (directly or through proxies)
Splits that are not representative (especially in time series or grouped data)
Label noise, inconsistent annotation, or heavy class imbalance
Different preprocessing between training and inference pipelines

How to reduce overfitting

Overfitting rarely improves with a single change. It usually gets better when you combine stronger evaluation, simpler modeling, and appropriate regularization.

The goal is to reduce variance without destroying the useful signal.

Improve evaluation and data hygiene

Use a proper split and keep a true test set untouched until the end
Use cross-validation, especially for smaller datasets or high-variance models
Prevent leakage by auditing features, timestamps, IDs, and preprocessing
Validate on data that resembles what you will see after launch

Reduce effective model complexity

Feature selection: keep only features that add a measurable signal
Dimensionality reduction: methods like PCA can help with correlated features
Simplify the model when data is limited
Constrain tree models (depth, minimum samples per leaf, and related settings)

Add regularization and training controls

L1 or L2 regularization: penalize overly complex solutions
Dropout (neural networks): encourages more robust representations
Early stopping: stop when validation performance stops improving
Data augmentation: for vision, audio, or text, create safe input variations

When “more data” is the best fix

If your dataset is small or narrow, complexity becomes risky. Adding diverse, representative data often improves generalization more than hyperparameter tuning alone.

Even modest increases in coverage can reduce variance. Focus on variety that matches real-world conditions, not just volume.

How to reduce underfitting

Underfitting is a sign that the model cannot express what the task requires. Fixes usually involve increasing capacity, relaxing constraints, or improving features.

Aim to add a usable signal without introducing leakage.

Increase capacity or flexibility

Use a more expressive model (often non-linear instead of linear)
Increase model size carefully (more trees, deeper networks, additional interactions)
Train longer if optimization has not converged

Reduce constraints that are too strong

Lower regularization strength if it is suppressing a useful signal
Relax restrictive hyperparameters (for example, a max depth set too low)
Revisit aggressive feature reduction that removed important information

Improve the inputs

Add better features (domain signals often matter more than algorithms)
Handle missing values and outliers consistently
Verify labels match the decision you want the model to learn

A quick checklist before you ship

This checklist helps catch common causes of poor generalization. It also makes it easier to communicate model readiness to your team. Run it before you declare a model “done”.

Metrics are measured on a split that matches production reality
Strong baselines are included (simple models, majority class, or rule-based checks)
Variance is measured (multiple folds or multiple random seeds)
Leakage risks are reviewed (IDs, timestamps, target proxies)
Final evaluation is done once on a locked test set

Learn machine learning with Code Labs Academy

If you want guided practice with these concepts, explore Code Labs Academy’s Data Science and AI program. It is designed to strengthen fundamentals through structured learning and hands-on projects.

Create tomorrow’s AI-driven technologies today: gain hands-on experience with Code Labs Academy’s online coding bootcamp.

Learn Technical Skills Online with Code Labs Academy

Join our supportive community, unlock your potential, and embark on a rewarding career path.

Financing

Alumni

Learning Hub

About

Community

For Companies