Deep Learning for Flood Risk Mapping with Sentinel‑2 in Python
Updated on February 21, 2026 16 minutes read
Floods aren’t just “water on a map.” They’re a fast-moving interaction between rainfall, terrain, rivers, infrastructure, and people, where errors can translate into delayed evacuations, misrouted aid, and underestimated impacts on livelihoods.
The technical ingredients for high-resolution flood mapping are now widely accessible. Sentinel‑2 imagery is open, geospatial tooling has become cloud-native, and U‑Net-style segmentation models are practical for environmental teams who already work in Python.
This deep dive is for environmental data scientists, geospatial engineers, and ML practitioners who want to build an end‑to‑end pipeline for flood extent segmentation from Sentinel‑2 and then translate those outputs into risk-oriented mapping products used in emergency management and climate adaptation.
By the end, you’ll understand how to choose Sentinel‑2 bands for water detection, how to train a U‑Net‑style segmentation model, how to evaluate it using rare-event metrics, and how to package outputs with uncertainty so they support real decision workflows.
Background and prerequisites for Sentinel‑2 flood mapping

This tutorial assumes you’re comfortable writing Python, working with NumPy-like arrays, and reading structured datasets from disk. You should also know the basics of training a model in PyTorch, including a training loop and validation.
Some geospatial familiarity helps, but you don’t need to be a GIS specialist. You mainly need to understand that satellite imagery is raster data on a grid, and that alignment, resolution, and nodata handling can make or break model performance.
On the domain side, it helps to distinguish flood extent mapping from flood risk mapping. Extent is a hazard observation, while risk adds exposure and vulnerability, and it must be communicated with uncertainty and operational constraints.
Flood risk mapping: from hazard to action
A common framing in disaster risk management is that risk depends on hazard, exposure, and vulnerability. A compact way to write that idea is:
A segmentation model gives you hazard extent: where floodwater is likely present at a given time. Risk mapping becomes possible when you intersect that hazard layer with exposure data such as population grids, building footprints, roads, croplands, and critical infrastructure.
This interdisciplinary jump matters because stakeholders do not make decisions from raw probability rasters. They make decisions from summaries, thresholds, “unknown due to cloud” flags, and prioritized areas under real constraints like limited field teams and limited time.
Sentinel‑2 essentials for flood detection
Sentinel‑2’s MSI instrument provides 13 spectral bands sampled at 10 m, 20 m, and 60 m resolutions. Those bands span visible, near‑infrared (NIR), and short‑wave infrared (SWIR), which are particularly informative for separating water from vegetation and soil.
In many applications, Sentinel‑2 Level‑2A (bottom‑of‑atmosphere reflectance) is preferred because it includes atmospheric correction and companion products used for scene interpretation. Sentinel Hub’s documentation describes Sentinel‑2 L2A as BOA reflectance derived from L1C using Sen2Cor and notes related classification and probability layers.
Optical imagery has a hard limitation: it cannot see through thick clouds. Flood events often coincide with cloud cover, which means “no flood detected” can easily become “no flood occurred” if you don’t carry observability through the pipeline.
Core intuition: why deep learning helps and where it fails
Flood mapping often starts with spectral indices like NDWI or MNDWI. These are fast, interpretable, and surprisingly effective in clear conditions for many water bodies.
Index thresholding, however, is mostly pixelwise. It struggles when the scene contains confusing dark regions like shadows, certain urban materials, burned land, or cloud shadows, and it has difficulty incorporating spatial context like river connectivity and floodplain morphology.
Deep learning segmentation models help because they learn patterns plus context. A dark patch connected to a river system and shaped by terrain often looks different from a dark rooftop shadow, and a convolutional model can learn those differences from data.
MNDWI as a strong baseline (and what it’s doing)
ESA training material describes MNDWI as a ratio between Sentinel‑2 band 3 (green) and band 11 (SWIR), often used for flood mapping with thresholding workflows.
A common definition is:
MNDWI works because water tends to be dark in SWIR relative to many land surfaces, while green reflectance behaves differently across water, vegetation, and soil. The limitation is that “dark in SWIR” is not unique to water, especially in urban scenes and shadowed regions.
The cloud problem is not optional
In the WorldFloods work (Portalés‑Julià et al., Scientific Reports 2023), the authors report that around half of the pixels are marked cloudy in the first available Sentinel‑2 revisit, occurring on average about 1.3 days after the flood date.
That statistic is a practical argument for treating flood mapping as a multi-output segmentation problem. If you can’t say “unobservable due to cloud,” your hazard product can quietly turn uncertainty into false certainty.
Why U‑Net-style segmentation fits this problem
U‑Net is an encoder–decoder model with skip connections. The encoder captures broader context via downsampling, while the decoder reconstructs full-resolution predictions, and skip connections preserve boundary information.
This structure fits flood mapping because flood extent is both contextual and geometric. You need a broad context to avoid confusing shadows with water, and you need crisp boundaries to support downstream GIS intersections like “flooded road length” or “flooded buildings count.”
Losses and metrics that match rare-event hazards
Flood pixels are often a minority. Naïve accuracy can look high even if the model predicts “no flood” nearly everywhere, so you need losses and metrics that penalize missing positives.
Weighted binary cross-entropy is one practical tool:
Dice overlap is another tool that aligns better with “map overlap” thinking:
For evaluation, Intersection-over-Union (IoU) is a common segmentation metric:
In emergency response, the cost of false negatives can be higher than false positives, which pushes you toward thresholds and loss weightings that protect recall. In damage assessment and reporting, precision becomes more important to avoid overstating impacts.
Hands-on implementation: train a U‑Net-style model on WorldFloods v2
We’ll use the WorldFloods v2 dataset because it provides Sentinel‑2 imagery paired with flood segmentation masks and explicitly represents clouds, which is essential for optical flood mapping.
The ML4Floods documentation describes WorldFloods as containing 509 pairs of Sentinel‑2 images and flood segmentation masks, and it explains that the v2 ground truth is organized into two mask channels: clear/cloud and land/water.
Environment setup
A practical stack for this tutorial uses PyTorch for modeling, rasterio for raster IO, and ml4floods to avoid re-implementing dataset handling.
pip install torch torchvision rasterio numpy pandas tqdm matplotlib
pip install ml4floods
Download WorldFloods v2 from Hugging Face
The ML4Floods docs show a huggingface-cli command to download the dataset to a local directory.
huggingface-cli download \
--cache-dir /path/to/cachedir \
--local-dir /path/to/localdir/WorldFloodsv2 \
--repo-type dataset isp-uv-es/WorldFloodsv2
The dataset is large, so plan storage accordingly. The official documentation also notes that the dataset is released under a Creative Commons non-commercial license, which matters if your work is commercial.
Load the dataset with ml4floods.
The tutorial workflow is: build a split JSON from the metadata CSV, load a default config, and create dataloaders.
import os
import json
import pandas as pd
import ml4floods
from ml4floods.models.config_setup import get_default_config
from ml4floods.models.dataset_setup import get_dataset
DATASET_PATH = "/path/to/localdir/WorldFloodsv2"
CONFIG_PATH = os.path.join(
os.path.dirname(ml4floods.__file__),
"models/configurations/worldfloods_template_v2.json"
)
CSV_PATH = os.path.join(DATASET_PATH, "dataset_metadata.csv")
JSON_SPLITS_PATH = os.path.join(DATASET_PATH, "train_test_split_from_csv.json")
def convert_metadata_csv_to_json():
out = {}
modalities = ["S2", "gt"]
df = pd.read_csv(CSV_PATH)
for split in df["split"].unique():
out[split] = {}
event_ids = df[df["split"] == split]["event id"].tolist()
for mod in modalities:
out[split][mod] = [
os.path.join(DATASET_PATH, split, mod, f"{eid}.tif")
for eid in event_ids
]
with open(JSON_SPLITS_PATH, "w") as f:
json.dump(out, f, indent=2)
convert_metadata_csv_to_json()
config = get_default_config(CONFIG_PATH)
config.data_params.loader_type = "local"
config.data_params.bucket_id = None
config.data_params.path_to_splits = DATASET_PATH
config.data_params.train_test_split_file = JSON_SPLITS_PATH
dm = get_dataset(config.data_params)
dm.prepare_data()
train_dl = dm.train_dataloader()
val_dl = dm.val_dataloader()
batch = next(iter(train_dl))
print(batch["image"].shape, batch["mask"].shape)
The dataset tutorial describes batches as a dict with a normalized 13-band image tensor and a 2-channel mask tensor. It also defines the mask encoding explicitly, with invalid pixels labeled as 0.
Convert masks into training targets and ignore invalid pixels
A clean approach is to train two binary heads: one for water probability and one for cloud probability. You compute loss and metrics only where labels are valid.
import torch
def unpack_targets(mask):
"""
mask: [B, 2, H, W]
channel 0: cloud (0 invalid, 1 clear, 2 cloud)
channel 1: water (0 invalid, 1 land, 2 water)
"""
cloud_raw = mask[:, 0]
water_raw = mask[:, 1]
cloud_valid = cloud_raw != 0
water_valid = water_raw != 0
cloud_target = (cloud_raw == 2).float()
water_target = (water_raw == 2).float()
return cloud_target, water_target, cloud_valid, water_valid
This design makes “unobservable” explicit. It reduces the temptation to force predictions where the dataset provides no supervision, and it aligns better with how flood products should be used in real decisions.
Define a compact U‑Net in PyTorch
This is a readable U‑Net implementation that follows the core encoder–decoder pattern with skip connections. It outputs two logits per pixel: one for cloud and one for water.
import torch
import torch.nn as nn
class ConvBlock(nn.Module):
def __init__(self, in_ch, out_ch):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=1),
nn.BatchNorm2d(out_ch),
nn.ReLU(inplace=True),
nn.Conv2d(out_ch, out_ch, 3, padding=1),
nn.BatchNorm2d(out_ch),
nn.ReLU(inplace=True),
)
def forward(self, x):
return self.net(x)
class UNet(nn.Module):
def __init__(self, in_channels=13, base=32, out_channels=2):
super().__init__()
self.pool = nn.MaxPool2d(2)
self.enc1 = ConvBlock(in_channels, base)
self.enc2 = ConvBlock(base, base * 2)
self.enc3 = ConvBlock(base * 2, base * 4)
self.enc4 = ConvBlock(base * 4, base * 8)
self.bottleneck = ConvBlock(base * 8, base * 16)
self.up4 = nn.ConvTranspose2d(base * 16, base * 8, 2, 2)
self.dec4 = ConvBlock(base * 16, base * 8)
self.up3 = nn.ConvTranspose2d(base * 8, base * 4, 2, 2)
self.dec3 = ConvBlock(base * 8, base * 4)
self.up2 = nn.ConvTranspose2d(base * 4, base * 2, 2, 2)
self.dec2 = ConvBlock(base * 4, base * 2)
self.up1 = nn.ConvTranspose2d(base * 2, base, 2, 2)
self.dec1 = ConvBlock(base * 2, base)
# Two logits per pixel: [cloud_logit, water_logit]
self.head = nn.Conv2d(base, out_channels, 1)
def forward(self, x):
e1 = self.enc1(x)
e2 = self.enc2(self.pool(e1))
e3 = self.enc3(self.pool(e2))
e4 = self.enc4(self.pool(e3))
b = self.bottleneck(self.pool(e4))
d4 = self.up4(b)
d4 = self.dec4(torch.cat([d4, e4], dim=1))
d3 = self.up3(d4)
d3 = self.dec3(torch.cat([d3, e3], dim=1))
d2 = self.up2(d3)
d2 = self.dec2(torch.cat([d2, e2], dim=1))
d1 = self.up1(d2)
d1 = self.dec1(torch.cat([d1, e1], dim=1))
return self.head(d1)
This structure maps well to flood segmentation because it preserves boundary detail while learning broader context needed to disambiguate water from visually similar artifacts like shadows and some dark surfaces.
Loss functions: weighted BCE plus Dice, masked by validity
We compute loss only over valid pixels and combine weighted BCE with Dice loss to handle imbalance and overlap.
import torch
import torch.nn.functional as F
def masked_bce_with_logits(logits, targets, valid_mask, pos_weight=None):
logits = logits[valid_mask]
targets = targets[valid_mask]
if pos_weight is None:
return F.binary_cross_entropy_with_logits(logits, targets)
return F.binary_cross_entropy_with_logits(logits, targets, pos_weight=pos_weight)
def masked_dice_loss(logits, targets, valid_mask, eps=1e-6):
probs = torch.sigmoid(logits)
probs = probs[valid_mask]
targets = targets[valid_mask]
intersection = (probs * targets).sum()
denom = probs.sum() + targets.sum() + eps
dice = (2 * intersection + eps) / denom
return 1 - dice
In flood mapping, weighting the water task more heavily than clouds can be justified by the downstream goal. The water mask becomes the hazard layer you intersect with exposure, while cloud prediction mainly supports uncertainty handling and observability.
Training loop with IoU for water and cloud
This loop trains for a few epochs and logs IoU metrics for both tasks, masking invalid pixels in metric computation.
import torch
from torch.optim import AdamW
from tqdm import tqdm
def iou_from_logits(logits, targets, valid_mask, threshold=0.5, eps=1e-6):
probs = torch.sigmoid(logits)
preds = (probs > threshold).float()
preds = preds[valid_mask]
targets = targets[valid_mask]
intersection = (preds * targets).sum()
union = preds.sum() + targets.sum() - intersection
return (intersection + eps) / (union + eps)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = UNet(in_channels=13, base=32, out_channels=2).to(device)
opt = AdamW(model.parameters(), lr=2e-4, weight_decay=1e-4)
water_weight = 0.8
cloud_weight = 0.2
for epoch in range(1, 6):
model.train()
train_loss = 0.0
for batch in tqdm(train_dl, desc=f"Epoch {epoch} [train]"):
x = batch["image"].to(device)
mask = batch["mask"].to(device)
cloud_t, water_t, cloud_valid, water_valid = unpack_targets(mask)
logits = model(x)
cloud_logit = logits[:, 0]
water_logit = logits[:, 1]
# Simple per-batch pos_weight estimates (clamped for stability)
with a torch.no_grad():
water_pos = water_t[water_valid].sum().clamp_min(1.0)
water_neg = water_valid.sum() - water_pos
w_pos = (water_neg / water_pos).clamp(1.0, 50.0)
cloud_pos = cloud_t[cloud_valid].sum().clamp_min(1.0)
cloud_neg = cloud_valid.sum() - cloud_pos
c_pos = (cloud_neg / cloud_pos).clamp(1.0, 50.0)
loss_water = (
masked_bce_with_logits(water_logit, water_t, water_valid, pos_weight=w_pos) +
masked_dice_loss(water_logit, water_t, water_valid)
)
loss_cloud = (
masked_bce_with_logits(cloud_logit, cloud_t, cloud_valid, pos_weight=c_pos) +
masked_dice_loss(cloud_logit, cloud_t, cloud_valid)
)
loss = water_weight * loss_water + cloud_weight * loss_cloud
opt.zero_grad()
loss.backward()
opt.step()
train_loss += loss.item()
model.eval()
val_iou_water = 0.0
val_iou_cloud = 0.0
n = 0
with torch.no_grad():
for batch in tqdm(val_dl, desc=f"Epoch {epoch} [val]"):
x = batch["image"].to(device)
mask = batch["mask"].to(device)
cloud_t, water_t, cloud_valid, water_valid = unpack_targets(mask)
logits = model(x)
val_iou_water += iou_from_logits(logits[:, 1], water_t, water_valid).item()
val_iou_cloud += iou_from_logits(logits[:, 0], cloud_t, cloud_valid).item()
n += 1
print(
f"Epoch {epoch} | "
f"train_loss={train_loss/max(1,len(train_dl)):.4f} | "
f"val_iou_water={val_iou_water/max(1,n):.3f} | "
f"val_iou_cloud={val_iou_cloud/max(1,n):.3f}"
)
This gets you to a working training pipeline. The next improvements tend to be practical rather than architectural, such as better normalization checks, band subsets, smarter sampling of flood-heavy tiles, and robust post-processing with cloud-aware rules.
Turning probabilities into usable flood and cloud masks
A model produces probabilities. Operational products require thresholds and uncertainty logic, and these thresholds are domain choices tied to the cost of errors.
import torch
@torch.no_grad()
def predict_masks(model, x, water_thr=0.5, cloud_thr=0.5):
logits = model(x)
cloud_p = torch.sigmoid(logits[:, 0])
water_p = torch.sigmoid(logits[:, 1])
cloud_mask = cloud_p > cloud_thr
water_mask = water_p > water_thr
return cloud_p, water_p, cloud_mask, water_mask
In response settings, you often keep probability rasters and only threshold at the end. That lets different users choose different operating points, such as a high-recall threshold for response teams and a higher-precision threshold for reporting.
Keep yourself honest with an MNDWI baseline
Deep learning is not a replacement for remote sensing fundamentals. MNDWI is a strong baseline, and it’s often the quickest way to detect pipeline errors like incorrect band ordering or broken normalization.
# Pseudocode: confirm band indices for your tensor ordering
# B03 = x[:, idx_green] # Green
# B11 = x[:, idx_swir1] # SWIR1
# mndwi = (B03 - B11) / (B03 + B11 + 1e-6)
# water_baseline = (mndwi > 0.0).float()
If your U‑Net performs worse than MNDWI in clear scenes, treat it as a signal that something is wrong in the pipeline before you assume the architecture is insufficient.
From flood extent segmentation to flood risk mapping outputs
A flood mask is a hazard observation. Flood risk mapping becomes possible when you intersect that hazard layer with exposure layers and communicate uncertainty so decisions remain grounded.
A practical operational output is a paired product: a flood probability raster and an observability raster (cloud probability or cloud mask). That pairing prevents “no observation due to cloud” from being misread as “no flood.”
In downstream workflows, you can compute flooded building counts, flooded road length, or flooded cropland area by intersecting predicted extent with exposure layers. These derived indicators are often what decision-makers actually consume, because they map directly to planning questions.
Portalés‑Julià et al. describe producing flood extent products and using them for impact-related workflows, emphasizing the importance of cloud handling and reliable mapping under imperfect observation.
Systems and production considerations for scalable flood mapping

Flood mapping at scale is as much data engineering as modeling. The main bottlenecks are usually data access, tiling, and orchestration rather than raw model complexity.
Querying Sentinel‑2 data via STAC for event-driven pipelines
A modern approach to accessing imagery is STAC, where you query by geometry and time rather than manually downloading scenes. The odc-stac notebook shows querying the Planetary Computer STAC API and searching the sentinel-2-l2a collection.
This matters because you can build workflows where a flood alert triggers a query, downloads only required assets for the AOI, and runs tiled inference immediately.
Handling multi-resolution bands and alignment correctly
Sentinel‑2 bands arrive at different resolutions. If you stack them without resampling, your channels won’t line up spatially, and the model can learn nonsense correlations.
Many ML pipelines resample all bands to a single working resolution, often 10 m or 20 m. The right choice depends on your downstream needs, because 10 m detail matters for roads and settlements, while 20 m may be sufficient for regional flood extents and cheaper to compute.
Tiling strategy and throughput constraints
Inference typically runs patchwise, often using 256×256 or 512×512 tiles, and then stitches predictions back into a georeferenced raster. Overlap tiling reduces boundary artifacts, and blending can smooth seams.
In production, you also care about latency. A slightly worse model that runs quickly may be more useful than a stronger model that produces results after the response window has passed.
Monitoring environmental ML under seasonality and geography shifts
Environmental data shifts naturally with seasonality, land cover changes, and regional differences. Monitoring should include per-band distribution checks, cloud frequency statistics, and output stability across revisits.
Operational monitoring also includes observability metrics. If cloud cover spikes, a responsible system should surface that as a limitation, not silently produce confident-looking “no flood” maps.
Risk, ethics, safety, and governance for flood risk mapping

Flood maps are persuasive. Their visual clarity can create overconfidence, especially when uncertainty is not communicated explicitly.
The cloud limitation makes this a concrete safety risk. When large fractions of pixels are cloudy in early post-event scenes, a system that doesn’t carry an “unobservable” layer can inadvertently encourage incorrect conclusions.
Geographic bias is another practical risk. A global dataset improves coverage, but performance can still vary across land cover types like dense urban areas, wetlands, forest canopy, or snow/ice conditions. The mitigation is careful evaluation by strata and targeted fine-tuning,g where the model will actually be used.
Licensing and data governance also matter. WorldFloods v2 is described as non-commercial in the official documentation, and that should shape how teams plan any deployed or revenue-generating use.
Finally, privacy and security still matter even when the imagery is public. Risk products can become sensitive when joined with settlement, building, or population layers, and access control and auditing should be treated as basic operational hygiene.
Domain scenario: an end-to-end flood risk mapping workflow for a river basin
Consider a regional disaster management team responsible for a river basin with seasonal flooding and mixed urban and agricultural exposure. They need timely flood extent maps that help them prioritize field validation and allocate resources.
They run a STAC query after each Sentinel‑2 pass, download relevant bands for the basin, tile the AOI into patches, and run cloud-aware U‑Net inference. They stitch outputs into georeferenced probability rasters and publish both flood probability and cloud probability.
For response mode, they choose a water threshold that favors recall and rely on the cloud layer to flag uncertain regions where human validation is required. For damage assessment, they may raise the threshold to reduce false positives and focus on conservative impact estimates.
They then intersect flood extent with roads and settlements and compute summary indicators per administrative unit. Those summaries feed operational dashboards that integrate meteorological forecasts and local reports.
In this pipeline, the model is not the decision-maker. It is a measurement tool that becomes useful only when combined with domain-aware thresholds, exposure intersections, and uncertainty communication.
Skills mapping and learning path for environmental data science careers
This project develops practical Python and deep learning skills in a way that matches real work. You practice handling multi-band imagery, writing robust training loops, and debugging data issues that are common in geospatial pipelines.
You also learn to tie ML metrics to domain costs. Choosing thresholds, weighting losses, and reporting uncertainty are not just technical details; they determine whether a flood map supports safe decision-making.
On the systems side, working with STAC and tiled inference maps directly to scalable geospatial operations. These are the skills that often separate a notebook demo from a pipeline that can run on new events reliably.
If you want to extend this into a capstone-level project, multi-sensor fusion is a natural next step. Optical flood mapping is powerful, but SAR (like Sentinel‑1) is critical when clouds dominate, and combining sensors can substantially improve effective coverage.
If you want a structured path to build these skills into a portfolio-ready project (with coaching and real-world tooling), explore Code Labs Academy’s Data Science & AI Bootcamp.
Conclusion
Sentinel‑2 provides multi-spectral signals that are very useful for water detection, but optical flood mapping is constrained by clouds, especially right after major events. A responsible system treats observability as part of the output rather than a hidden assumption.
U‑Net-style segmentation models add value by learning spatial context and boundary detail that index thresholding can miss. That value becomes practical when you pair flood probability with cloud probability and carry both through post-processing and reporting.
Flood extent segmentation becomes flood risk mapping only after you intersect hazard outputs with exposure layers and communicate uncertainty clearly enough for real operational use. The interdisciplinary work is in the handoff between modeling, geospatial analysis, and decision constraints.
If you build one artifact from this tutorial, make it a paired, GIS-ready output: a flood probability raster and a cloud/observability raster for a real AOI, plus a simple summary of impacted roads and settlements with uncertainty flags.