MLOps · Semantic Search · CVPR 2025

PhotoPrism
Semantic Search
with Continuous
Learning

A self-hosted AI photo search engine built on Chameleon Cloud Kubernetes — combining CLIP retrieval, Qdrant ANN indexing, and a Qwen2-VL-2B reranker that fine-tunes itself from implicit user feedback using POLAR (CVPR 2025) LoRA adapters.

CLIP ViT-B/32 Qwen2-VL-2B + LoRA POLAR (CVPR 2025) Qdrant HNSW Kubernetes Chameleon Cloud MLflow Flickr30K-CFQ

512-d CLIP embedding space

31,783 training images

0.2% trainable params (LoRA)

100 clicks → auto-retrain

Data Flow

Upload & Ingest Pipeline

Upload path is decoupled from vector indexing — PhotoPrism responds immediately while an async feature-worker daemon handles CLIP embedding and Qdrant upserts in the background.

Step 1–2

Photo Staging & Import

Browser POSTs multipart photo. PhotoPrism validates, stages, then moves to Originals on PUT trigger. HMAC-SHA256 webhook fires to ingest-api.

Step 3–4

S3 Upload & Job Enqueue

ingest-api downloads original, uploads to Chameleon Swift under originals/{uid}. Atomically inserts to image_metadata + feature_jobs in Postgres.

Step 5–6

Worker Pickup

feature-worker polls every 5s using FOR UPDATE SKIP LOCKED — supports horizontal scaling of replicas without a distributed lock manager.

Step 7–8

CLIP Embed + Qdrant

CLIP ViT-B/32 produces a 512-d L2-normalized unit vector. Upserted into Qdrant HNSW collection with cosine distance. Job marked done.

Inference Pipeline

Semantic Search & Reranking

A sem: prefix routes queries through a 5-stage pipeline: CLIP text embedding → ANN retrieval → multimodal reranking → analytics logging → result rendering.

search-api — live trace

$ GET /search?q=sem:water+at+sunset

────────────────────────────

stage 1 text embed → POST /embed/text

dim=512, latency=4.2ms

stage 2 ANN search → qdrant.search(limit=10)

top score=0.31, latency=1.8ms

stage 3 rerank → POST /rerank (10 candidates)

qwen2-vl bfloat16, latency=340ms

stage 4 analytics → INSERT search_queries + results

stage 5 response → {query_id, hits[10]}

✓ query_id=a3f7c21b total=346ms

①

Text Embedding

CLIP text transformer + projection head produces 512-d unit vector in the joint image-text space.

②

ANN Retrieval

Qdrant HNSW cosine search — O(log N) approximate nearest neighbour. Hard cap at 10 candidates for reranker.

③

Multimodal Reranking

Qwen2-VL-2B scores each candidate image+query pair. Softmax across all 10 → probability distribution.

④

Analytics Logging

Every query + ranked results written to Postgres. clicked=0 initialised; updated to 1 on interaction.

⑤

Fallback Guarantee

If reranker-api is unreachable, search-api falls back to ANN order. Increments fallback_error counter.

Continuous Learning Loop

Model Architecture

Qwen2-VL-2B + LoRA Reranker

A 2B-parameter vision-language model rescores ANN candidates using both the query text and the actual image. LoRA adapters (r=8) target all attention projections — only 0.2% of parameters train, enabling A100 fine-tuning in minutes.

r = 8 LoRA rank

α = 16 LoRA alpha (2r)

4 layers q/k/v/o_proj targeted

bfloat16 inference precision

Scoring Mechanism

Each candidate image is fetched via a presigned S3 URL (5-min window). The model receives the query text and image, produces logits at the last token position, and a softmax is applied across all 10 candidates to yield a probability distribution.

# Qwen2-VL prompt template

prompt = "Does this image match:

'{query_text}'?

Answer with score 0–1."

# Extract last-token logit

score = out.logits[:, -1, :].mean()

# Softmax across candidates

probs = softmax(scores)

Auto-Redeploy Pipeline

After retraining: LoRA weights saved → Makefile bumps patch tag → Docker build+push → running container stopped+restarted with new image. INTERNAL_TOKEN preserved by inspecting live container env before stop.

Training Strategy

Flickr30K-CFQ + POLAR Architecture

Models trained on verbose caption benchmarks fail on real short queries. Flickr30K-CFQ addresses this with 5 query types per image — from raw sentences to bare keyword tags — reflecting actual search behavior.

Query Type Diversity

31,783 Flickr30K images × 5 query types ≈ 159K training triples. Generated using LLaVA (tags), Llama 3.1 8B (paraphrasing), and Flickr30K Entities annotations.

Raw sentence "Two men in hard hats look at a blueprint"

Similar "A pair of workers review construction plans"

Fragment "hard hat blueprint"

Phrase "men looking at plans"

Tag "construction plan worker safety"

Loss Function

Binary cross-entropy with logits over (query, image, label=1.0/0.0) triples. Negatives are non-clicked results shown to the user — introducing position bias (known limitation).

ℒ = BCE(ŷ_logit, y)

Training Configuration

# feedback_train.py hyperparameters

model Qwen2-VL-2B-Instruct

lora_rank r=8, α=16

dropout 0.05

targets q/k/v/o_proj

optimizer AdamW lr=2e-4

precision bfloat16 (AMP)

epochs 1 (online)

threshold 100 untrained rows

# post-training metrics

eval Recall@1/5/10

nDCG@1/5/10

score separation

tracker MLflow artifacts

POLAR (CVPR 2025)

This project adapts the POLAR paradigm from personalizing CLIP to reranking. CLIP ViT-B/32 (frozen) handles first-stage retrieval; Qwen2-VL-2B with LoRA rescores using implicit click history — making the system personalize to each user's search patterns over time.

Performance Engineering

Inference Optimizations

Seven targeted optimizations across CLIP, Qdrant, and the reranker keep end-to-end semantic search under 400ms even with GPU reranking.

CLIP

L2 Normalization at Embed Time

Unit vectors from both CLIP encoders mean cosine similarity in Qdrant reduces to a dot product — no re-normalization at query time.

CLIP

No-Grad Inference

torch.no_grad() disables autograd, cutting memory allocation ~50% vs. training mode. No gradient tape on the forward pass.

Qdrant

Hard Cap on Reranker Candidates

limit=min(top_k, 10) regardless of UI request size (up to 156). Bounds reranker latency since each candidate needs a GPU forward pass.

Reranker

bfloat16 Autocast

All reranker passes use torch.amp.autocast("cuda", dtype=bfloat16) — ~50% VRAM reduction vs. float32 with negligible precision loss.