Querybank normalisation (QB-Norm)
Summary: A video digest of the QB-Norm framework, introduced in the work "Cross Modal Retrieval with Querybank Normalisation" by S. V. Bogolin, I. Croitoru, H. Jin, Y. Liu andS. Albanie published at CVPR 2022.
Paper: The paper can be found on arxiv here.
Code: Code and models can be found here.
Topics: vision and language, hubness, cross modal retrieval
Slides: link (pdf)
References
- R. E. Bellman, "Adaptive Control Processes: A Guided Tour", Princeton: Princeton University Press (1961)
- P. Demartines, "Analyse de données par réseaux de neurones auto-organisés", Dissertation (in French) (1994)
- G. Doddington et al., "Sheep, goats, lambs and wolves: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation" (1998)
- K. Beyer et al., "When is “nearest neighbor” meaningful?", ICDT (1999)
- A. R. Hicklin et al., "The Myth of the Goats: How Many People Have Fingerprints that are Hard to Match?" (2005)
- A. Berenzweig, "Anchors and hubs in audio-based music similarity", PhD thesis (2007)
- D. François et al., "The concentration of fractional distances", TKDE (2007)
- J-J. Aucouturier et al., "A scale-free distribution of false positives for a large class of audio similarity measures", PR (2008)
- M. Radovanovic et al., "Hubs in space: Popular nearest neighbors in high-dimensional data", JMLR (2010)
- C. Wah et al., "The caltech-ucsd birds-200-2011 dataset", (2011)
- D. Chen et al., "Collecting highly parallel data for paraphrase evaluation." ACL-HTL (2011)
- A. Frome et al., "Devise: A deep visual-semantic embedding model" NeurIPS (2013)
- T. Low et al., "The hubness phenomenon: Fact or artifact?", Towards Advanced Data Analysis by Combining Soft Computing and Statistics (2013)
- T-Y. Lin et al., "Microsoft coco: Common objects in context", ECCV (2014)
- R. Socher et al., "Grounded compositional semantics for finding and describing images with sentences", ACL (2014)
- G. Dinu et al. "Improving zero-shot learning by mitigating the hubness problem", ICLR Workshops (2015)
- R. Xu et al., "Jointly modeling deep video and compositional text to bridge vision and language in a unified framework", AAAI (2015)
- F. Caba Heilbron et al. "Activitynet: A large-scale video benchmark for human activity understanding", CVPR (2015)
- X. Chen et al., "Microsoft coco captions: Data collection and evaluation server", arxiv (2015)
- J. Xu et al., "MSR-VTT: A large video description dataset for bridging video and language", CVPR (2016)
- H. Oh Song et al., "Deep metric learning via lifted structured feature embedding", CVPR (2016)
- S. Smith et al., "Offline bilingual word vectors, orthogonal transformations and the inverted softmax" ICLR (2017)
- L. A. Hendricks et al., "Localizing moments in video with natural language", ICCV (2017)
- A. Rohrbach et al., "Movie description", IJCV (2017)
- A. Conneau et al., "Word translation without parallel data", ICLR (2018)
- R. Feldbauer et al., "Fast approximate hubness reduction for large high-dimensional data", ICBK (2018)
- B. Zhang et al., "Cross-modal and hierarchical modeling of video and text", ECCV (2018)
- A. Miech et al., "Learning a text-video embedding from incomplete and heterogeneous data", arxiv (2018)
- F. Faghri et al., "VSE++: Improving visual-semantic embeddings with hard negatives", BMVC (2018)
- Y. Liu et al., "Use what you have: Video retrieval using representations from collaborative experts", BMVC (2019)
- R. Feldbauer et al., "A comprehensive empirical comparison of hubness reduction in high-dimensional spaces", KIS (2019)
- J. Johnson et al., "Billion-scale similarity search with gpus", TBD (2019)
- F. Liu et al., "A strong and robust baseline for text-image matching", ACL workshops (2019)
- X. Wang et al., "VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research", ICCV (2019)
- C. D. Kim et al., "AudioCaps: Generating captions for audios in the wild", NACL-HLT (2019)
- J. Dong et al., "Dual encoding for zero-example video retrieval", CVPR (2019)
- X. Wang et al., "Multi-similarity loss with general pair weighting for deep metric learning", CVPR (2019)
- V. Gabeur et al., "Multi-modal transformer for video retrieval", ECCV (2020)
- S. Chen et al., "Fine-grained video-text retrieval with hierarchical graph reasoning", CVPR (2020)
- X. Li et al., "OSCAR: Object-semantics aligned pre-training for vision-language tasks", ECCV (2020)
- K. Roth et al., "Revisiting training strategies and generalization performance in deep metric learning", ICML (2020)
- X. Wang et al., "Cross-batch memory for embedding learning", CVPR (2020)
- A. Brown et al., "Smooth-AP: Smoothing the path towards large-scale image retrieval", ECCV (2020)
- A-M. Oncescu et al., "Audio retrieval with natural language queries", Interspeech (2021)
- A-M. Oncescu et al., "QuerYD: A video dataset with high-quality textual and audio narrations", ICASSP (2021)
- I. Croitoru et al., "Teachtext: Crossmodal generalized distillation for text-video retrieval", ICCV (2021)
- H. Fang et al., "Clip2video: Mastering video-text retrieval via image clip", arxiv (2021)
- G. Geigle et al., "Retrieve fast, rerank smart: Cooperative and joint approaches for improved cross-modal retrieval", arxiv (2021)
- M. Patrick et al., "Support-set bottlenecks for video-text representation learning", ICLR (2021)
- M. Bain et al., "Frozen in time: A joint video and image encoder for end-to-end retrieval", ICCV (2021)
- H. Luo et al., "Clip4clip: An empirical study of clip for end to end video clip retrieval", arxiv (2021)
- A. Miech et al., "Thinking fast and slow: Efficient text-to-visual retrieval with transformers", CVPR (2021)
- A. Radford et al., "Learning transferable visual models from natural language supervision", ICML (2021)
- P. Zhang et al., "VinVL: Revisiting visual representations in vision-language models", CVPR (2021)
- E. Levi et al., "Rethinking preventing class-collapsing in metric learning with margin-based losses", ICCV (2021)
- S-V. Bogolin et al., "Cross Modal Retrieval with Querybank Normalisation", CVPR (2022)