Název: Automatic data generation of incorrect image-text pairs for effective contrastive learning of CLIP model
Automatic data generation of incorrect image-text pairs for effective contrastive learning of CLIP model
Autoři: Tagami, Rina
Kobayashi, Hiroki
Akizuki, Shuichi
Hashimoto, Manabu
Citace zdrojového dokumentu: WSCG 2024: full papers proceedings: 32. International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, p. 187-196.
Datum vydání: 2024
Nakladatel: Václav Skala - UNION Agency
Typ dokumentu: konferenční příspěvek
conferenceObject
URI: http://hdl.handle.net/11025/57391
ISSN: 2464–4625 (online)
2464–4617 (print)
Klíčová slova: velké jazykové modely;načítání obrázků;obrazovo-textový dataset;CLIP;kontrastivní učení;k jako clustering
Klíčová slova v dalším jazyce: large language models;image retrieval;image-text dataset;CLIP;contrastive learning;k-Means Clustering
Abstrakt v dalším jazyce: In this study, we proposed a method for automatically generating high-quality CLIP(Contrastive Language Image Pre-training) training data to improve the performance of text-based image retrieval using CLIP. In general, two types of image-text pair data are used in CLIP training: correct pairs and incorrect pairs. correct pairs are pairs in which the image and text content are compatible, and are created by scraping or other methods. incorrect pairs are incompatible image-text pairs, which are created by changing the combination of the correct pairs. CLIP is completed by contrastive training to increase the similarity between the image and text in correct pairs and decrease the similarity in incorrect pairs. However, when there are multiple images in the training data that are similar to each other, the text attached to them is also considered to be similar to each other, and although it is preferable to treat them as correct pairs, changed pairs are treated as incorrect pairs. In other words, incorrect pairs with high relevance between image texts are learned as having low relevance between image texts, and this inconsistency has a negative impact on the CLIP model. Therefore, if two images taken from the training data are not similar, then the similarity between texts assigned to them should also be low, so that a highly reliable incorrect pair can be created by exchanging the assigned text with each other. We applied this idea to the results of clustering the images and texts in the training data, respectively, and used the similarity between the clusters to generate an incorrect pair, then learned to increase the negative effect as the similarity between images was lower. The results of an experiment using the Amazon review dataset, which is commonly used in this field, showed a 21.0% improvement in Rank@1 score compared to vanilla CLIP
Práva: © Václav Skala - UNION Agency
Vyskytuje se v kolekcích:WSCG 2024: Full Papers Proceedings

Soubory připojené k záznamu:
Soubor Popis VelikostFormát 
C02-2024.pdfPlný text7,43 MBAdobe PDFZobrazit/otevřít


Použijte tento identifikátor k citaci nebo jako odkaz na tento záznam: http://hdl.handle.net/11025/57391

Všechny záznamy v DSpace jsou chráněny autorskými právy, všechna práva vyhrazena.