"Find Me This, But Different":
A Deep Dive into Composed Image Retrieval
How TIRG, short for Text Image Residual Gating, handles multimodal retrieval by treating the image as the anchor and the text as the edit.
A normal image retrieval system only sees the current image, and a normal text retrieval system ignores the visual reference. This paper asks: can we retrieve images using both a reference image and a textual modification?
This is the problem studied by Nam Vo et al. in their CVPR 2019 paper, Composing Text and Image for Image Retrieval: An Empirical Study. Despite the title, the paper is not just a comparison study. It introduces TIRG, a simple way to combine image and text features, and shows strong results on three very different benchmarks.
The Problem: Retrieval from an Image Plus a Text Edit
The query has two components playing asymmetric roles:
- The image is the anchor. Most of what you want is already in it.
- The text is the modifier. It only tells the system what should change.
This asymmetry is everything. A method that treats image and text as symmetric inputs, blending them with equal weight, misses the point entirely. You're not trying to find something that "looks like the image and sounds like the text." You're trying to find something that looks like the image except where the text says otherwise.
The text doesn't describe the target. It describes what's different from the source.
Older methods do not handle this especially well. Concatenating features misses useful cross-modal interactions. Element-wise addition or multiplication also treats both inputs the same way, even though the task is really "keep the dress, change the color."
TIRG: Text Image Residual Gating
The key insight is borrowed from residual networks: instead of generating the entire target representation from scratch, start from the image representation and modify it based on the text.
f(img, txt) = w_g ⊙ φ(img) + w_r ⊙ g(φ(img), φ(txt)) where: φ(img), φ(txt) = image and text embeddings w_g = σ( W_g · [φ(img); φ(txt)] ) ← "keep" gate ∈ (0,1)ᵈ w_r = σ( W_r · [φ(img); φ(txt)] ) ← "add" gate ∈ (0,1)ᵈ g(·) = W₂ · ReLU( W₁ · [φ(img); φ(txt)] ) ← 2-layer MLP ⊙ = element-wise multiply
Step-by-step through the composition
Encoding each modality. The reference image goes through a pretrained ResNet that is fine-tuned end-to-end, giving φ(img) ∈ ℝᵈ. The text is tokenized, embedded with GloVe, and passed through a single-layer LSTM; the final hidden state gives φ(txt) ∈ ℝᵈ. Both end up in the same d-dimensional space, which is necessary for element-wise gating.
Computing the gates from both inputs. The concatenation [φ(img); φ(txt)] ∈ ℝ²ᵈ goes through two separate linear layers followed by sigmoid:
w_g = σ( W_g · [φ(img) ; φ(txt)] + b_g ) ← keep gate w_r = σ( W_r · [φ(img) ; φ(txt)] + b_r ) ← add gate
Because both gates are conditioned on the concatenation of image and text, the model doesn't learn a generic "color gate." It learns: given this specific text applied to this specific image, which dimensions should move?
The residual transform g(·). This is a 2-layer MLP that models the delta. It does not try to rebuild the full target representation. It only learns what should change. This follows the same basic idea as ResNet: learning the change can be easier than learning the whole mapping from scratch.
g(φ(img), φ(txt)) = W₂ · ReLU( W₁ · [φ(img) ; φ(txt)] )
Gated addition.
f(img,txt) = w_g ⊙ φ(img) + w_r ⊙ g(φ(img), φ(txt)) ↑ keep what stays ↑ add what changes
At the extremes: w_g ≈ 1 and w_r ≈ 0 → output is just the image. w_g ≈ 0 and w_r ≈ 1 → output is purely the text-driven residual. The model learns a dimension-wise middle ground: some dimensions fully suppressed, others fully preserved.
L2 normalization. Both the composed query and the target image embedding are L2-normalized to the unit hypersphere before distance computation:
‖f(img,txt)‖₂ = 1 and ‖φ(x*)‖₂ = 1
‖a − b‖₂² = 2(1 − cos θ)
→ minimizing Euclidean distance = maximizing cosine similarity
→ prevents magnitude cheating
Why not just add or multiply?
| Method | Formula | What goes wrong |
|---|---|---|
| Sum | φ(img) + φ(txt) | Text features bleed into all image dimensions uniformly |
| Hadamard product | φ(img) ⊙ φ(txt) | Dense text embedding suppresses most image features |
| Concat + linear | W·[φ(img); φ(txt)] | Linear, so it cannot express selective nonlinear edits well |
| FiLM | γ(t)⊙φ(img)+β(t) | Gates depend only on text, so the gating is blind to the source image |
| TIRG | w_g(img,txt)⊙φ(img) + w_r(img,txt)⊙g(img,txt) | Both gates conditioned on both inputs simultaneously |
The Loss Function: Batch-Hard Triplet Loss
The architecture produces a composed query vector f(img, txt). Training needs a signal that says: this vector should be close to the correct target and far from everything else. That's the job of the triplet loss.
Triplet loss from first principles
A triplet is (anchor, positive, negative). Here:
-
Anchor = the composed query
f(r, t) -
Positive = the correct target embedding
φ(x*) -
Negative = any other image
φ(x⁻)that is not the correct target
L = max( 0, d(anchor, positive) − d(anchor, negative) + α ) d(a,b) = ‖a − b‖₂ (after L2 normalization) α = margin hyperparameter (typically 0.1 to 0.5) The loss fires only when the negative is dangerously close: d(anchor, negative) < d(anchor, positive) + α
Why batch-hard mining?
Randomly sampled negatives are usually too easy. For a query like "red dress to blue," a photo of a hammer teaches the model nothing. Batch-hard mining instead picks the hardest triplet inside each mini-batch:
For each anchor i in the batch: hardest positive = argmax_{j: y_j=y_i} d(f_i, φ(x_j)) ← correct target FURTHEST from the query hardest negative = argmin_{k: y_k≠y_i} d(f_i, φ(x_k)) ← wrong image CLOSEST to the query loss_i = max( 0, d(f_i, hardest_pos) − d(f_i, hardest_neg) + α )
Full training objective
L = (1/|B|) · Σᵢ max( 0, ‖f(rᵢ,tᵢ) − φ(xᵢ*)‖ − ‖f(rᵢ,tᵢ) − φ(xᵢ⁻ₕₐᵣd)‖ + α ) Gradients flow through everything jointly: ∂L/∂W_g, ∂L/∂W_r ← keep and add gates ∂L/∂W_MLP ← residual transform ∂L/∂W_CNN ← image encoder (fine-tuned) ∂L/∂W_LSTM ← text encoder (fine-tuned)
Everything trains jointly from one signal: does the composed query end up near the right image?
Three Datasets, Three Flavors of the Problem
Synthetic colored 3D objects. Queries look like "find something shaped like this but larger and green." There are two settings: 3D-to-3D, where both query and target are 3D renders, and 2D-to-3D, where the query is flat but the target is 3D. It is a clean setup for testing localized edits.
200,000 real fashion product images. Most edits change one attribute at a time, such as color, style, material, or cut. This dataset is much noisier than CSS because many garments look very similar.
Natural images from 245 object categories with 115 adjective modifiers: "change state to melted," "change state to ancient," or "change state to broken." These edits are more abstract, which makes MITStates the hardest benchmark here.
A Quick Note on Recall@K
Retrieval is evaluated with Recall@K (R@K): what fraction of correct answers appear in the top-K results? Since composed retrieval has one correct target per query, R@K simplifies to did the target land in the top K? averaged across all test queries.
- R@1: strict. The correct result has to be first.
- R@10: more forgiving, and closer to what a person might actually browse.
- R@50: a broader check for whether the system is at least in the right region.
A good system gains across all three: it doesn't just put the answer somewhere in the ranking, it puts it near the top.
The Results: TIRG Wins, But the Gap Tells the Story
Fashion200k: Table 1
| Method | R@1 | R@10 | R@50 |
|---|---|---|---|
| Han et al. [prior work] | 6.3 | 19.9 | 38.3 |
| Image only | 3.5 | 22.7 | 43.7 |
| Text only | 1.0 | 12.3 | 21.8 |
| Concatenation | 11.9 ±1.0 | 39.7 ±1.0 | 62.6 ±0.7 |
| Show and Tell | 12.3 ±1.1 | 40.2 ±1.7 | 61.8 ±0.9 |
| Param Hashing | 12.2 ±1.1 | 40.0 ±1.1 | 61.7 ±0.8 |
| Relationship | 13.0 ±0.4 | 40.5 ±0.7 | 62.4 ±0.6 |
| MRN | 13.4 ±0.4 | 40.0 ±0.8 | 61.9 ±0.6 |
| FiLM | 12.9 ±0.7 | 39.5 ±2.1 | 61.9 ±1.9 |
| TIRG (ours) | 14.1 ±0.6 | 42.5 ±0.7 | 63.8 ±0.8 |
Bold = best. Underline = second best. TIRG leads on all three metrics, including an 18% relative gain over concatenation at R@1.
On Fashion200k, TIRG reaches 14.1 R@1 and 63.8 R@50, beating every baseline. It also outperforms "Show and Tell," which tries to describe the target in language first. That suggests it is better to model the edit directly than to generate a verbal description of the final image.
MIT-States: Table 2
| Method | R@1 | R@5 | R@10 |
|---|---|---|---|
| Image only | 3.3 ±0.1 | 12.8 ±0.2 | 20.9 ±0.1 |
| Text only | 7.4 ±0.4 | 21.5 ±0.9 | 32.7 ±0.8 |
| Concatenation | 11.8 ±0.2 | 30.8 ±0.5 | 42.1 ±0.3 |
| Show and Tell | 11.9 ±0.1 | 31.0 ±0.5 | 42.0 ±0.8 |
| Att. as Operator | 8.1 ±0.5 | 27.3 ±0.3 | 39.1 ±0.3 |
| Relationship | 12.3 ±0.5 | 31.9 ±0.7 | 42.9 ±0.9 |
| MRN | 11.9 ±0.6 | 30.5 ±0.3 | 41.0 ±0.2 |
| FiLM | 10.1 ±0.3 | 27.7 ±0.7 | 38.3 ±0.7 |
| TIRG (ours) | 12.2 ±0.4 | 31.9 ±0.3 | 43.1 ±0.5 |
Bold = best. Underline = second best. TIRG is roughly tied with Relationship at R@1 but does better on R@5 and R@10, which points to better overall ranking quality.
On MIT-States, TIRG gets 12.2 R@1 and 43.1 R@10. It still wins on R@5 and R@10, but Relationship is essentially tied at top-1. That fits the dataset: MIT-States is less about simple feature swaps and more about broader transformations like "rusted" or "cooked." When the text describes a bigger semantic change, the advantage of residual gating gets smaller.
TIRG works best when the change is local, like changing color or sleeve style. When the edit is broader and more semantic, the gap to the baselines gets smaller.
CSS Dataset: Table 3
| Method | 3D-to-3D | 2D-to-3D |
|---|---|---|
| Image only | 6.3 | 6.3 |
| Text only | 0.1 | 0.1 |
| Concatenation | 60.6 ±0.8 | 27.3 |
| Show and Tell | 33.0 ±3.2 | 6.0 |
| Parameter hashing | 60.5 ±1.9 | 31.4 |
| Relationship | 62.1 ±1.2 | 30.6 |
| MRN | 60.1 ±2.7 | 26.8 |
| FiLM | 65.6 ±0.5 | 43.7 |
| TIRG (ours) | 73.7 ±1.0 | 46.6 |
Bold = best. Underline = second best. On 3D-to-3D, TIRG beats FiLM by 8 points. The 2D-to-3D setting is harder for every method because the query and target come from different visual forms.
TIRG scores 73.7 R@1 on 3D-to-3D, which is 8 points above FiLM. The gap is biggest here because CSS is built around very targeted edits like "make it green" or "make it larger." That is exactly the kind of setting where gated composition helps most.
Seeing It Work: Qualitative Results
Limitations and Failure Modes
The gates are not interpretable. ResNet features are distributed and entangled. If color and texture co-vary in training data, which often happens in fashion, the model may not separate them cleanly. Changing color can accidentally shift the perceived fabric type too.
Multi-attribute modifications are hard. "Make it longer and beige" requires two independent modifications. A single residual vector g(·) must encode both simultaneously, and performance degrades as concurrent changes multiply.
Evaluation is binary. R@K counts only the annotated ground truth as correct. A different blue dress that matches the request perfectly still scores zero if it is not the labeled target. So the benchmark can understate real-world usefulness.
Triplet loss instability. Batch-hard mining can trigger collapsed embeddings in early training if negatives are too hard too soon. Warm-up strategies help but add fragility to the training recipe.
Scalability is untested. CSS, Fashion200k, and MIT-States are benchmarks, not production workloads. Real e-commerce has millions of items and users who don't phrase edits cleanly. Whether gating survives that messiness is an open question the paper leaves for future work.
Conclusion: The Gating Trick That Worked
The paper is more focused than its title makes it sound. It asks a clean question: how should image and text be combined when they play different roles? Its answer is residual gating conditioned on both inputs.
TIRG does well across all three datasets, with the biggest gains on tasks where the edits are localized, like color swaps or attribute changes. The idea is simple: start from the image, then apply the text as a modification instead of trying to rebuild everything from scratch.
The core idea still feels relevant. A lot of later multimodal systems run into the same basic tension: what should stay, and what should change? TIRG is an early and very clear example of building that distinction directly into the model.
If the query has a reference and an edit, the model should treat them that way. They are not two interchangeable inputs. That asymmetry is the task.