MLOps · Semantic Search · CVPR 2025

PhotoPrism
Semantic Search
with Continuous
Learning

A self-hosted AI photo search engine built on Chameleon Cloud Kubernetes — combining CLIP retrieval, Qdrant ANN indexing, and a Qwen2-VL-2B reranker that fine-tunes itself from implicit user feedback using POLAR (CVPR 2025) LoRA adapters.

CLIP ViT-B/32 Qwen2-VL-2B + LoRA POLAR (CVPR 2025) Qdrant HNSW Kubernetes Chameleon Cloud MLflow Flickr30K-CFQ
512-d CLIP embedding space
31,783 training images
0.2% trainable params (LoRA)
100 clicks → auto-retrain

Infrastructure Topology

Two compute environments on Chameleon Cloud: a 3-node Kubernetes cluster at KVM@TACC running the main ML platform, and a dedicated A100 GPU VM at CHI@UC for reranker serving and online retraining.

Upload & Ingest Pipeline

Upload path is decoupled from vector indexing — PhotoPrism responds immediately while an async feature-worker daemon handles CLIP embedding and Qdrant upserts in the background.

Step 1–2
Photo Staging & Import
Browser POSTs multipart photo. PhotoPrism validates, stages, then moves to Originals on PUT trigger. HMAC-SHA256 webhook fires to ingest-api.
Step 3–4
S3 Upload & Job Enqueue
ingest-api downloads original, uploads to Chameleon Swift under originals/{uid}. Atomically inserts to image_metadata + feature_jobs in Postgres.
Step 5–6
Worker Pickup
feature-worker polls every 5s using FOR UPDATE SKIP LOCKED — supports horizontal scaling of replicas without a distributed lock manager.
Step 7–8
CLIP Embed + Qdrant
CLIP ViT-B/32 produces a 512-d L2-normalized unit vector. Upserted into Qdrant HNSW collection with cosine distance. Job marked done.

Qwen2-VL-2B + LoRA Reranker

A 2B-parameter vision-language model rescores ANN candidates using both the query text and the actual image. LoRA adapters (r=8) target all attention projections — only 0.2% of parameters train, enabling A100 fine-tuning in minutes.

r = 8 LoRA rank
α = 16 LoRA alpha (2r)
4 layers q/k/v/o_proj targeted
bfloat16 inference precision

Scoring Mechanism

Each candidate image is fetched via a presigned S3 URL (5-min window). The model receives the query text and image, produces logits at the last token position, and a softmax is applied across all 10 candidates to yield a probability distribution.

# Qwen2-VL prompt template
prompt = "Does this image match:
'{query_text}'?
Answer with score 0–1."
# Extract last-token logit
score = out.logits[:, -1, :].mean()
# Softmax across candidates
probs = softmax(scores)

Auto-Redeploy Pipeline

After retraining: LoRA weights saved → Makefile bumps patch tag → Docker build+push → running container stopped+restarted with new image. INTERNAL_TOKEN preserved by inspecting live container env before stop.

Flickr30K-CFQ + POLAR Architecture

Models trained on verbose caption benchmarks fail on real short queries. Flickr30K-CFQ addresses this with 5 query types per image — from raw sentences to bare keyword tags — reflecting actual search behavior.

Query Type Diversity

31,783 Flickr30K images × 5 query types ≈ 159K training triples. Generated using LLaVA (tags), Llama 3.1 8B (paraphrasing), and Flickr30K Entities annotations.

Raw sentence "Two men in hard hats look at a blueprint"
Similar "A pair of workers review construction plans"
Fragment "hard hat blueprint"
Phrase "men looking at plans"
Tag "construction plan worker safety"
Loss Function

Binary cross-entropy with logits over (query, image, label=1.0/0.0) triples. Negatives are non-clicked results shown to the user — introducing position bias (known limitation).

ℒ = BCE(ŷ_logit, y)

Training Configuration

# feedback_train.py hyperparameters
model Qwen2-VL-2B-Instruct
lora_rank r=8, α=16
dropout 0.05
targets q/k/v/o_proj
optimizer AdamW lr=2e-4
precision bfloat16 (AMP)
epochs 1 (online)
threshold 100 untrained rows
# post-training metrics
eval Recall@1/5/10
nDCG@1/5/10
score separation
tracker MLflow artifacts
POLAR (CVPR 2025)

This project adapts the POLAR paradigm from personalizing CLIP to reranking. CLIP ViT-B/32 (frozen) handles first-stage retrieval; Qwen2-VL-2B with LoRA rescores using implicit click history — making the system personalize to each user's search patterns over time.

Inference Optimizations

Seven targeted optimizations across CLIP, Qdrant, and the reranker keep end-to-end semantic search under 400ms even with GPU reranking.

CLIP

L2 Normalization at Embed Time

Unit vectors from both CLIP encoders mean cosine similarity in Qdrant reduces to a dot product — no re-normalization at query time.

CLIP

No-Grad Inference

torch.no_grad() disables autograd, cutting memory allocation ~50% vs. training mode. No gradient tape on the forward pass.

Qdrant

Hard Cap on Reranker Candidates

limit=min(top_k, 10) regardless of UI request size (up to 156). Bounds reranker latency since each candidate needs a GPU forward pass.

Reranker

bfloat16 Autocast

All reranker passes use torch.amp.autocast("cuda", dtype=bfloat16) — ~50% VRAM reduction vs. float32 with negligible precision loss.

Reranker

Pixel Budget Optimization

Inference uses 128–256×28² pixels vs. 256–512×28² at training. Fewer tokens per image → faster forward-pass latency at serving time.

System

Single-Flight Retrain Guard

asyncio.Lock ensures only one retrain fires per threshold breach — even if 100 clicks arrive simultaneously. Others return immediately.