CVPR 2019 · Paper Walkthrough

"Find Me This, But Different":
A Deep Dive into Composed Image Retrieval

How TIRG, short for Text Image Residual Gating, handles multimodal retrieval by treating the image as the anchor and the text as the edit.

Paper: Nam Vo et al., Google Research · CVPR 2019 · Read time: 16 min · Topics: Computer Vision · Multimodal AI · Metric Learning

Imagine showing a system a photo of a red dress and saying, "find the same dress but in blue."

A normal image retrieval system only sees the current image, and a normal text retrieval system ignores the visual reference. This paper asks: can we retrieve images using both a reference image and a textual modification?

This is the problem studied by Nam Vo et al. in their CVPR 2019 paper, Composing Text and Image for Image Retrieval: An Empirical Study. Despite the title, the paper is not just a comparison study. It introduces TIRG, a simple way to combine image and text features, and shows strong results on three very different benchmarks.

The Problem: Retrieval from an Image Plus a Text Edit

Composed retrieval demo — **The composed retrieval task.** Given a reference photo and the instruction "no people and switch to night-time," the system retrieves matching night images using both the image and the text together.

The query has two components playing asymmetric roles:

The image is the anchor. Most of what you want is already in it.
The text is the modifier. It only tells the system what should change.

This asymmetry is everything. A method that treats image and text as symmetric inputs, blending them with equal weight, misses the point entirely. You're not trying to find something that "looks like the image and sounds like the text." You're trying to find something that looks like the image except where the text says otherwise.

The text doesn't describe the target. It describes what's different from the source.

Older methods do not handle this especially well. Concatenating features misses useful cross-modal interactions. Element-wise addition or multiplication also treats both inputs the same way, even though the task is really "keep the dress, change the color."

TIRG: Text Image Residual Gating

Full TIRG architecture — **The full TIRG pipeline.** The reference image goes through a CNN; the modification text through an LSTM. The composition module produces a gated feature (what to keep) plus a residual feature (what to change), summed into a query vector. A triplet loss trains everything end-to-end.

The key insight is borrowed from residual networks: instead of generating the entire target representation from scratch, start from the image representation and modify it based on the text.

f(img, txt) = w_g ⊙ φ(img)  +  w_r ⊙ g(φ(img), φ(txt))

where:
  φ(img), φ(txt) = image and text embeddings
  w_g = σ( W_g · [φ(img); φ(txt)] )   ← "keep" gate  ∈ (0,1)ᵈ
  w_r = σ( W_r · [φ(img); φ(txt)] )   ← "add"  gate  ∈ (0,1)ᵈ
  g(·) = W₂ · ReLU( W₁ · [φ(img); φ(txt)] )   ← 2-layer MLP
  ⊙    = element-wise multiply

Step-by-step through the composition

Encoding each modality. The reference image goes through a pretrained ResNet that is fine-tuned end-to-end, giving φ(img) ∈ ℝᵈ. The text is tokenized, embedded with GloVe, and passed through a single-layer LSTM; the final hidden state gives φ(txt) ∈ ℝᵈ. Both end up in the same d-dimensional space, which is necessary for element-wise gating.

Computing the gates from both inputs. The concatenation [φ(img); φ(txt)] ∈ ℝ²ᵈ goes through two separate linear layers followed by sigmoid:

w_g = σ( W_g · [φ(img) ; φ(txt)] + b_g )  ← keep gate
w_r = σ( W_r · [φ(img) ; φ(txt)] + b_r )  ← add  gate

Because both gates are conditioned on the concatenation of image and text, the model doesn't learn a generic "color gate." It learns: given this specific text applied to this specific image, which dimensions should move?

The residual transform g(·). This is a 2-layer MLP that models the delta. It does not try to rebuild the full target representation. It only learns what should change. This follows the same basic idea as ResNet: learning the change can be easier than learning the whole mapping from scratch.

g(φ(img), φ(txt)) = W₂ · ReLU( W₁ · [φ(img) ; φ(txt)] )

Gated addition.

f(img,txt) = w_g ⊙ φ(img)  +  w_r ⊙ g(φ(img), φ(txt))
              ↑ keep what stays    ↑ add what changes

At the extremes: w_g ≈ 1 and w_r ≈ 0 → output is just the image. w_g ≈ 0 and w_r ≈ 1 → output is purely the text-driven residual. The model learns a dimension-wise middle ground: some dimensions fully suppressed, others fully preserved.

L2 normalization. Both the composed query and the target image embedding are L2-normalized to the unit hypersphere before distance computation:

‖f(img,txt)‖₂ = 1  and  ‖φ(x*)‖₂ = 1

‖a − b‖₂²  =  2(1 − cos θ)
→ minimizing Euclidean distance = maximizing cosine similarity
→ prevents magnitude cheating

TIRG vs FiLM: the key difference

FiLM also uses a gate and shift, but both are computed from text only: γ(t) ⊙ φ(img) + β(t). So the instruction "make it blue" produces the same gate whether the source is a dress or a car. TIRG conditions its gates on both image and text, so the same instruction can behave differently depending on what it is editing.

Why not just add or multiply?

Method	Formula	What goes wrong
Sum	φ(img) + φ(txt)	Text features bleed into all image dimensions uniformly
Hadamard product	φ(img) ⊙ φ(txt)	Dense text embedding suppresses most image features
Concat + linear	W·[φ(img); φ(txt)]	Linear, so it cannot express selective nonlinear edits well
FiLM	γ(t)⊙φ(img)+β(t)	Gates depend only on text, so the gating is blind to the source image
TIRG	w_g(img,txt)⊙φ(img) + w_r(img,txt)⊙g(img,txt)	Both gates conditioned on both inputs simultaneously

The Loss Function: Batch-Hard Triplet Loss

The architecture produces a composed query vector f(img, txt). Training needs a signal that says: this vector should be close to the correct target and far from everything else. That's the job of the triplet loss.

Triplet loss from first principles

A triplet is (anchor, positive, negative). Here:

Anchor = the composed query f(r, t)
Positive = the correct target embedding φ(x*)
Negative = any other image φ(x⁻) that is not the correct target

L = max( 0,  d(anchor, positive) − d(anchor, negative) + α )

d(a,b) = ‖a − b‖₂   (after L2 normalization)
α      = margin hyperparameter (typically 0.1 to 0.5)

The loss fires only when the negative is dangerously close:
  d(anchor, negative) < d(anchor, positive) + α

Intuition for the margin α

The loss is zero when the positive is already closer than the negative by at least α. Without that margin, the model could get away with tiny separations that do not generalize very well. The margin forces a clearer gap.

Why batch-hard mining?

Randomly sampled negatives are usually too easy. For a query like "red dress to blue," a photo of a hammer teaches the model nothing. Batch-hard mining instead picks the hardest triplet inside each mini-batch:

For each anchor i in the batch:

  hardest positive = argmax_{j: y_j=y_i}  d(f_i, φ(x_j))
                     ← correct target FURTHEST from the query

  hardest negative = argmin_{k: y_k≠y_i}  d(f_i, φ(x_k))
                     ← wrong image CLOSEST to the query

  loss_i = max( 0, d(f_i, hardest_pos) − d(f_i, hardest_neg) + α )

Why this matters especially here

In multimodal retrieval, embedding space is full of images that are almost right. For "black dress, make it blue," the hardest negatives are often other blue dresses with the wrong style. Those examples force the model to preserve the garment shape as well as apply the edit. Easy negatives do not help much.

Full training objective

L = (1/|B|) · Σᵢ max( 0, ‖f(rᵢ,tᵢ) − φ(xᵢ*)‖ − ‖f(rᵢ,tᵢ) − φ(xᵢ⁻ₕₐᵣd)‖ + α )

Gradients flow through everything jointly:
  ∂L/∂W_g, ∂L/∂W_r   ← keep and add gates
  ∂L/∂W_MLP           ← residual transform
  ∂L/∂W_CNN           ← image encoder (fine-tuned)
  ∂L/∂W_LSTM          ← text encoder (fine-tuned)

Everything trains jointly from one signal: does the composed query end up near the right image?

Three Datasets, Three Flavors of the Problem

CSS

Synthetic colored 3D objects. Queries look like "find something shaped like this but larger and green." There are two settings: 3D-to-3D, where both query and target are 3D renders, and 2D-to-3D, where the query is flat but the target is 3D. It is a clean setup for testing localized edits.

Fashion200k

200,000 real fashion product images. Most edits change one attribute at a time, such as color, style, material, or cut. This dataset is much noisier than CSS because many garments look very similar.

MITStates

Natural images from 245 object categories with 115 adjective modifiers: "change state to melted," "change state to ancient," or "change state to broken." These edits are more abstract, which makes MITStates the hardest benchmark here.

A Quick Note on Recall@K

Retrieval is evaluated with Recall@K (R@K): what fraction of correct answers appear in the top-K results? Since composed retrieval has one correct target per query, R@K simplifies to did the target land in the top K? averaged across all test queries.

R@1: strict. The correct result has to be first.
R@10: more forgiving, and closer to what a person might actually browse.
R@50: a broader check for whether the system is at least in the right region.

A good system gains across all three: it doesn't just put the answer somewhere in the ranking, it puts it near the top.

The Results: TIRG Wins, But the Gap Tells the Story

Fashion200k: Table 1

Table 1: Fashion200k retrieval performance (R@K)

Method	R@1	R@10	R@50
Han et al. [prior work]	6.3	19.9	38.3
Image only	3.5	22.7	43.7
Text only	1.0	12.3	21.8
Concatenation	11.9 ^±1.0	39.7 ^±1.0	62.6 ^±0.7
Show and Tell	12.3 ^±1.1	40.2 ^±1.7	61.8 ^±0.9
Param Hashing	12.2 ^±1.1	40.0 ^±1.1	61.7 ^±0.8
Relationship	13.0 ^±0.4	40.5 ^±0.7	62.4 ^±0.6
MRN	13.4 ^±0.4	40.0 ^±0.8	61.9 ^±0.6
FiLM	12.9 ^±0.7	39.5 ^±2.1	61.9 ^±1.9
TIRG (ours)	14.1 ^±0.6	42.5 ^±0.7	63.8 ^±0.8

Bold = best. Underline = second best. TIRG leads on all three metrics, including an 18% relative gain over concatenation at R@1.

On Fashion200k, TIRG reaches 14.1 R@1 and 63.8 R@50, beating every baseline. It also outperforms "Show and Tell," which tries to describe the target in language first. That suggests it is better to model the edit directly than to generate a verbal description of the final image.

MIT-States: Table 2

Table 2: MIT-States retrieval performance (R@K)

Method	R@1	R@5	R@10
Image only	3.3 ^±0.1	12.8 ^±0.2	20.9 ^±0.1
Text only	7.4 ^±0.4	21.5 ^±0.9	32.7 ^±0.8
Concatenation	11.8 ^±0.2	30.8 ^±0.5	42.1 ^±0.3
Show and Tell	11.9 ^±0.1	31.0 ^±0.5	42.0 ^±0.8
Att. as Operator	8.1 ^±0.5	27.3 ^±0.3	39.1 ^±0.3
Relationship	12.3 ^±0.5	31.9 ^±0.7	42.9 ^±0.9
MRN	11.9 ^±0.6	30.5 ^±0.3	41.0 ^±0.2
FiLM	10.1 ^±0.3	27.7 ^±0.7	38.3 ^±0.7
TIRG (ours)	12.2 ^±0.4	31.9 ^±0.3	43.1 ^±0.5

Bold = best. Underline = second best. TIRG is roughly tied with Relationship at R@1 but does better on R@5 and R@10, which points to better overall ranking quality.

On MIT-States, TIRG gets 12.2 R@1 and 43.1 R@10. It still wins on R@5 and R@10, but Relationship is essentially tied at top-1. That fits the dataset: MIT-States is less about simple feature swaps and more about broader transformations like "rusted" or "cooked." When the text describes a bigger semantic change, the advantage of residual gating gets smaller.

TIRG works best when the change is local, like changing color or sleeve style. When the edit is broader and more semantic, the gap to the baselines gets smaller.

CSS Dataset: Table 3

Table 3: CSS dataset retrieval performance (R@1)

Method	3D-to-3D	2D-to-3D
Image only	6.3	6.3
Text only	0.1	0.1
Concatenation	60.6 ^±0.8	27.3
Show and Tell	33.0 ^±3.2	6.0
Parameter hashing	60.5 ^±1.9	31.4
Relationship	62.1 ^±1.2	30.6
MRN	60.1 ^±2.7	26.8
FiLM	65.6 ^±0.5	43.7
TIRG (ours)	73.7 ^±1.0	46.6

Bold = best. Underline = second best. On 3D-to-3D, TIRG beats FiLM by 8 points. The 2D-to-3D setting is harder for every method because the query and target come from different visual forms.

TIRG scores 73.7 R@1 on 3D-to-3D, which is 8 points above FiLM. The gap is biggest here because CSS is built around very targeted edits like "make it green" or "make it larger." That is exactly the kind of setting where gated composition helps most.

Seeing It Work: Qualitative Results

CSS qualitative comparison — **CSS qualitative results.** Each row shows the query, the text edit, the true target, and the top prediction from Concat, FiLM, and TIRG. TIRG usually applies the right change while keeping the rest of the object intact.

Fashion200k retrieval results — **Fashion200k retrieval.** Given a source garment and a short modification, TIRG returns relevant matches. Green boxes mark ground-truth targets. TIRG often places them earlier in the ranked list, especially for subtle edits like length or silhouette.

MITStates results — **MITStates, the hardest benchmark.** "Change state to ancient" on a modern clock returns ornate historical timepieces. "Change state to broken" on an umbrella returns damaged objects. The model is clearly picking up something real about broader transformations.

Limitations and Failure Modes

The gates are not interpretable. ResNet features are distributed and entangled. If color and texture co-vary in training data, which often happens in fashion, the model may not separate them cleanly. Changing color can accidentally shift the perceived fabric type too.

Multi-attribute modifications are hard. "Make it longer and beige" requires two independent modifications. A single residual vector g(·) must encode both simultaneously, and performance degrades as concurrent changes multiply.

Evaluation is binary. R@K counts only the annotated ground truth as correct. A different blue dress that matches the request perfectly still scores zero if it is not the labeled target. So the benchmark can understate real-world usefulness.

Triplet loss instability. Batch-hard mining can trigger collapsed embeddings in early training if negatives are too hard too soon. Warm-up strategies help but add fragility to the training recipe.

Scalability is untested. CSS, Fashion200k, and MIT-States are benchmarks, not production workloads. Real e-commerce has millions of items and users who don't phrase edits cleanly. Whether gating survives that messiness is an open question the paper leaves for future work.

Conclusion: The Gating Trick That Worked

The paper is more focused than its title makes it sound. It asks a clean question: how should image and text be combined when they play different roles? Its answer is residual gating conditioned on both inputs.

TIRG does well across all three datasets, with the biggest gains on tasks where the edits are localized, like color swaps or attribute changes. The idea is simple: start from the image, then apply the text as a modification instead of trying to rebuild everything from scratch.

The core idea still feels relevant. A lot of later multimodal systems run into the same basic tension: what should stay, and what should change? TIRG is an early and very clear example of building that distinction directly into the model.

If the query has a reference and an edit, the model should treat them that way. They are not two interchangeable inputs. That asymmetry is the task.

Written by

Amit Pandey · Sakshi Pandey · Sharath H N

Thank you for reading. We hope this walkthrough made TIRG a little more intuitive.

Computer Vision Image Retrieval Multimodal AI CVPR 2019 Triplet Loss Metric Learning Representation Learning