A self-hosted AI photo search engine built on Chameleon Cloud Kubernetes — combining CLIP retrieval, Qdrant ANN indexing, and a Qwen2-VL-2B reranker that fine-tunes itself from implicit user feedback using POLAR (CVPR 2025) LoRA adapters.
Two compute environments on Chameleon Cloud: a 3-node Kubernetes cluster at KVM@TACC running the main ML platform, and a dedicated A100 GPU VM at CHI@UC for reranker serving and online retraining.
Upload path is decoupled from vector indexing — PhotoPrism responds immediately while an async feature-worker daemon handles CLIP embedding and Qdrant upserts in the background.
ingest-api.ingest-api downloads original, uploads to Chameleon Swift under originals/{uid}. Atomically inserts to image_metadata + feature_jobs in Postgres.feature-worker polls every 5s using FOR UPDATE SKIP LOCKED — supports horizontal scaling of replicas without a distributed lock manager.done.
A sem: prefix routes queries through a 5-stage pipeline: CLIP text embedding → ANN retrieval → multimodal reranking → analytics logging → result rendering.
CLIP text transformer + projection head produces 512-d unit vector in the joint image-text space.
Qdrant HNSW cosine search — O(log N) approximate nearest neighbour. Hard cap at 10 candidates for reranker.
Qwen2-VL-2B scores each candidate image+query pair. Softmax across all 10 → probability distribution.
Every query + ranked results written to Postgres. clicked=0 initialised; updated to 1 on interaction.
If reranker-api is unreachable, search-api falls back to ANN order. Increments fallback_error counter.
A 2B-parameter vision-language model rescores ANN candidates using both the query text and the actual image. LoRA adapters (r=8) target all attention projections — only 0.2% of parameters train, enabling A100 fine-tuning in minutes.
Each candidate image is fetched via a presigned S3 URL (5-min window). The model receives the query text and image, produces logits at the last token position, and a softmax is applied across all 10 candidates to yield a probability distribution.
After retraining: LoRA weights saved → Makefile bumps patch tag → Docker build+push → running container stopped+restarted with new image. INTERNAL_TOKEN preserved by inspecting live container env before stop.
Models trained on verbose caption benchmarks fail on real short queries. Flickr30K-CFQ addresses this with 5 query types per image — from raw sentences to bare keyword tags — reflecting actual search behavior.
31,783 Flickr30K images × 5 query types ≈ 159K training triples. Generated using LLaVA (tags), Llama 3.1 8B (paraphrasing), and Flickr30K Entities annotations.
Binary cross-entropy with logits over (query, image, label=1.0/0.0) triples. Negatives are non-clicked results shown to the user — introducing position bias (known limitation).
This project adapts the POLAR paradigm from personalizing CLIP to reranking. CLIP ViT-B/32 (frozen) handles first-stage retrieval; Qwen2-VL-2B with LoRA rescores using implicit click history — making the system personalize to each user's search patterns over time.
Seven targeted optimizations across CLIP, Qdrant, and the reranker keep end-to-end semantic search under 400ms even with GPU reranking.
Unit vectors from both CLIP encoders mean cosine similarity in Qdrant reduces to a dot product — no re-normalization at query time.
torch.no_grad() disables autograd, cutting memory allocation ~50% vs. training mode. No gradient tape on the forward pass.
limit=min(top_k, 10) regardless of UI request size (up to 156). Bounds reranker latency since each candidate needs a GPU forward pass.
All reranker passes use torch.amp.autocast("cuda", dtype=bfloat16) — ~50% VRAM reduction vs. float32 with negligible precision loss.
Inference uses 128–256×28² pixels vs. 256–512×28² at training. Fewer tokens per image → faster forward-pass latency at serving time.
asyncio.Lock ensures only one retrain fires per threshold breach — even if 100 clicks arrive simultaneously. Others return immediately.