Querybank normalisation (QB-Norm)

Summary: A video digest of the QB-Norm framework, introduced in the work "Cross Modal Retrieval with Querybank Normalisation" by S. V. Bogolin, I. Croitoru, H. Jin, Y. Liu andS. Albanie published at CVPR 2022.
Paper: The paper can be found on arxiv here.
Code: Code and models can be found here.
Topics: vision and language, hubness, cross modal retrieval
Slides: link (pdf)

References

R. E. Bellman, "Adaptive Control Processes: A Guided Tour", Princeton: Princeton University Press (1961)
P. Demartines, "Analyse de données par réseaux de neurones auto-organisés", Dissertation (in French) (1994)
G. Doddington et al., "Sheep, goats, lambs and wolves: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation" (1998)
K. Beyer et al., "When is “nearest neighbor” meaningful?", ICDT (1999)
A. R. Hicklin et al., "The Myth of the Goats: How Many People Have Fingerprints that are Hard to Match?" (2005)
A. Berenzweig, "Anchors and hubs in audio-based music similarity", PhD thesis (2007)
D. François et al., "The concentration of fractional distances", TKDE (2007)
J-J. Aucouturier et al., "A scale-free distribution of false positives for a large class of audio similarity measures", PR (2008)
M. Radovanovic et al., "Hubs in space: Popular nearest neighbors in high-dimensional data", JMLR (2010)
C. Wah et al., "The caltech-ucsd birds-200-2011 dataset", (2011)
D. Chen et al., "Collecting highly parallel data for paraphrase evaluation." ACL-HTL (2011)
A. Frome et al., "Devise: A deep visual-semantic embedding model" NeurIPS (2013)
T. Low et al., "The hubness phenomenon: Fact or artifact?", Towards Advanced Data Analysis by Combining Soft Computing and Statistics (2013)
T-Y. Lin et al., "Microsoft coco: Common objects in context", ECCV (2014)
R. Socher et al., "Grounded compositional semantics for finding and describing images with sentences", ACL (2014)
G. Dinu et al. "Improving zero-shot learning by mitigating the hubness problem", ICLR Workshops (2015)
R. Xu et al., "Jointly modeling deep video and compositional text to bridge vision and language in a unified framework", AAAI (2015)
F. Caba Heilbron et al. "Activitynet: A large-scale video benchmark for human activity understanding", CVPR (2015)
X. Chen et al., "Microsoft coco captions: Data collection and evaluation server", arxiv (2015)
J. Xu et al., "MSR-VTT: A large video description dataset for bridging video and language", CVPR (2016)
H. Oh Song et al., "Deep metric learning via lifted structured feature embedding", CVPR (2016)
S. Smith et al., "Offline bilingual word vectors, orthogonal transformations and the inverted softmax" ICLR (2017)
L. A. Hendricks et al., "Localizing moments in video with natural language", ICCV (2017)
A. Rohrbach et al., "Movie description", IJCV (2017)
A. Conneau et al., "Word translation without parallel data", ICLR (2018)
R. Feldbauer et al., "Fast approximate hubness reduction for large high-dimensional data", ICBK (2018)
B. Zhang et al., "Cross-modal and hierarchical modeling of video and text", ECCV (2018)
A. Miech et al., "Learning a text-video embedding from incomplete and heterogeneous data", arxiv (2018)
F. Faghri et al., "VSE++: Improving visual-semantic embeddings with hard negatives", BMVC (2018)
Y. Liu et al., "Use what you have: Video retrieval using representations from collaborative experts", BMVC (2019)
R. Feldbauer et al., "A comprehensive empirical comparison of hubness reduction in high-dimensional spaces", KIS (2019)
J. Johnson et al., "Billion-scale similarity search with gpus", TBD (2019)
F. Liu et al., "A strong and robust baseline for text-image matching", ACL workshops (2019)
X. Wang et al., "VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research", ICCV (2019)
C. D. Kim et al., "AudioCaps: Generating captions for audios in the wild", NACL-HLT (2019)
J. Dong et al., "Dual encoding for zero-example video retrieval", CVPR (2019)
X. Wang et al., "Multi-similarity loss with general pair weighting for deep metric learning", CVPR (2019)
V. Gabeur et al., "Multi-modal transformer for video retrieval", ECCV (2020)
S. Chen et al., "Fine-grained video-text retrieval with hierarchical graph reasoning", CVPR (2020)
X. Li et al., "OSCAR: Object-semantics aligned pre-training for vision-language tasks", ECCV (2020)
K. Roth et al., "Revisiting training strategies and generalization performance in deep metric learning", ICML (2020)
X. Wang et al., "Cross-batch memory for embedding learning", CVPR (2020)
A. Brown et al., "Smooth-AP: Smoothing the path towards large-scale image retrieval", ECCV (2020)
A-M. Oncescu et al., "Audio retrieval with natural language queries", Interspeech (2021)
A-M. Oncescu et al., "QuerYD: A video dataset with high-quality textual and audio narrations", ICASSP (2021)
I. Croitoru et al., "Teachtext: Crossmodal generalized distillation for text-video retrieval", ICCV (2021)
H. Fang et al., "Clip2video: Mastering video-text retrieval via image clip", arxiv (2021)
G. Geigle et al., "Retrieve fast, rerank smart: Cooperative and joint approaches for improved cross-modal retrieval", arxiv (2021)
M. Patrick et al., "Support-set bottlenecks for video-text representation learning", ICLR (2021)
M. Bain et al., "Frozen in time: A joint video and image encoder for end-to-end retrieval", ICCV (2021)
H. Luo et al., "Clip4clip: An empirical study of clip for end to end video clip retrieval", arxiv (2021)
A. Miech et al., "Thinking fast and slow: Efficient text-to-visual retrieval with transformers", CVPR (2021)
A. Radford et al., "Learning transferable visual models from natural language supervision", ICML (2021)
P. Zhang et al., "VinVL: Revisiting visual representations in vision-language models", CVPR (2021)
E. Levi et al., "Rethinking preventing class-collapsing in metric learning with margin-based losses", ICCV (2021)
S-V. Bogolin et al., "Cross Modal Retrieval with Querybank Normalisation", CVPR (2022)

Samuel Albanie