Intersection over Union

What Is Intersection over Union?

Picture two rectangles of transparent colored film laid on a table -- one red (your prediction) and one blue (the ground truth). Where they overlap, you see purple. IoU asks: what fraction of the total colored area is purple? If the rectangles are perfectly aligned, the answer is 1.0 (100% overlap). If they do not touch at all, the answer is 0.0. This single number captures how well a predicted bounding box matches the true object location.

Technically, Intersection over Union (also called the Jaccard index for sets) is defined as:

$IoU (A, B) = \frac{∣ A \cap B ∣}{∣ A \cup B ∣} = \frac{Area of Overlap}{Area of Union}$

where $A$ and $B$ are two bounding box regions. IoU ranges from 0 (no overlap) to 1 (perfect overlap). It is symmetric: $IoU (A, B) = IoU (B, A)$ .

How It Works

Computing IoU for Axis-Aligned Boxes

Given two boxes defined by their corners:

Box A: $(x_{1}^{A}, y_{1}^{A}, x_{2}^{A}, y_{2}^{A})$
Box B: $(x_{1}^{B}, y_{1}^{B}, x_{2}^{B}, y_{2}^{B})$

Step 1: Compute intersection coordinates: $x_{1}^{I} = max (x_{1}^{A}, x_{1}^{B}), y_{1}^{I} = max (y_{1}^{A}, y_{1}^{B})$ $x_{2}^{I} = min (x_{2}^{A}, x_{2}^{B}), y_{2}^{I} = min (y_{2}^{A}, y_{2}^{B})$

Step 2: Compute intersection area: $Area_{I} = max (0, x_{2}^{I} - x_{1}^{I}) \times max (0, y_{2}^{I} - y_{1}^{I})$

Step 3: Compute union area: $Area_{U} = Area_{A} + Area_{B} - Area_{I}$

Step 4: Compute IoU: $IoU = \frac{Area _{I}}{Area _{U}}$

IoU Thresholds in Evaluation

Threshold	Name	Use Case
0.5	AP50	Standard PASCAL VOC metric, lenient
0.75	AP75	Strict localization quality
0.5:0.95	AP (COCO primary)	Average over 10 thresholds: 0.50, 0.55, ..., 0.95

A detection is a true positive if IoU with a matched ground-truth box exceeds the threshold and the class is correct; otherwise, it is a false positive.

IoU as a Loss Function

Standard IoU loss for bounding box regression: $L_{IoU} = 1 - IoU (B_{pred}, B_{gt})$

This has a critical flaw: when boxes do not overlap ( $IoU = 0$ ), the gradient is zero, providing no learning signal.

Generalized IoU (GIoU, 2019)

Rezatofighi et al. addressed the zero-gradient problem:

$GIoU = IoU - \frac{∣ C ∖ ( A \cup B ) ∣}{∣ C ∣}$

where $C$ is the smallest enclosing box of $A$ and $B$ . GIoU ranges from $- 1$ to $1$, providing a gradient even when boxes do not overlap.

Distance-IoU (DIoU) and Complete-IoU (CIoU, 2020)

$DIoU = IoU - \frac{ρ ^{2} ( b , b ^{g t} )}{c ^{2}}$

where $ρ$ is the Euclidean distance between box centers and $c$ is the diagonal of the enclosing box.

CIoU adds an aspect ratio consistency term: $CIoU = IoU - \frac{ρ ^{2} ( b , b ^{g t} )}{c ^{2}} - α v$

where $v$ measures aspect ratio consistency and $α$ is a balancing parameter.

Why It Matters

IoU is the standard localization metric used in every major detection benchmark (PASCAL VOC, COCO, Open Images, LVIS).
COCO's primary metric (AP averaged over IoU 0.5:0.95) incentivizes precise localization, not just approximate overlap.
IoU-based losses (GIoU, DIoU, CIoU) consistently outperform $L_{1}$ and $L_{2}$ box regression losses by 1-3% AP because they directly optimize the evaluation metric.
IoU thresholds define what counts as a detection, making them among the most consequential hyperparameters in the entire detection pipeline.

Key Technical Details

Computation cost: IoU between two boxes requires ~10 arithmetic operations. Pairwise IoU for $n$ boxes is $O (n^{2})$ .
Scale invariance: IoU is invariant to box scale -- a 50% overlap at $32 \times 32 $scores the same as at \$ 512 \times 512$.
GIoU loss improves Faster R-CNN by ~1% AP and YOLOv3 by ~2-3% AP compared to smooth $L_{1}$ loss.
CIoU loss further improves over GIoU by ~0.5-1% AP by incorporating center distance and aspect ratio.
PASCAL VOC uses AP50 (IoU $\geq 0.5$ ); COCO uses AP (averaged over 0.5:0.05:0.95), which is much stricter.
IoU 0.5 vs. 0.75: A detector scoring 50% AP50 might score only 30% AP75, revealing coarse localization.

Common Misconceptions

"IoU 0.5 means the prediction is 50% correct." IoU 0.5 means 50% of the union area is shared, but the prediction may include significant background or miss part of the object. Visually, IoU 0.5 boxes can look quite misaligned.
"IoU is always the best matching metric." For very small objects (e.g., $10 \times 10$ pixels), a shift of a few pixels causes a large IoU drop, even though the detection is essentially correct. Some benchmarks use pixel distance for very small objects.
"L1 or L2 loss on box coordinates is equivalent to IoU." These losses treat each coordinate independently and are not scale-invariant. A 10-pixel error matters much more for a $30 \times 30 $box than a \$ 300 \times 300$ box; IoU captures this naturally.

Connections to Other Concepts

Non-Maximum Suppression: Uses IoU to determine which overlapping boxes to suppress.
R-CNN: IoU thresholds determine positive/negative assignment during training (e.g., IoU $\geq 0.5$ for positives, IoU $< 0.3$ for negatives).
DETR (Detection Transformer): Uses Generalized IoU in its matching cost and training loss.
Focal Loss: Training sample assignment relies on IoU between anchors and ground-truth boxes.
Sliding Window and Region Proposals: Proposal recall is evaluated at specific IoU thresholds.