Automatic data generation of incorrect image-text pairs for effective contrastive learning of CLIP model

Tagami, Rina; Kobayashi, Hiroki; Akizuki, Shuichi; Hashimoto, Manabu

Title:	Automatic data generation of incorrect image-text pairs for effective contrastive learning of CLIP model Automatic data generation of incorrect image-text pairs for effective contrastive learning of CLIP model
Authors:	Tagami, Rina Kobayashi, Hiroki Akizuki, Shuichi Hashimoto, Manabu
Citation:	WSCG 2024: full papers proceedings: 32. International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, p. 187-196.
Issue Date:	2024
Publisher:	Václav Skala - UNION Agency
Document type:	konferenční příspěvek conferenceObject
URI:	http://hdl.handle.net/11025/57391
ISSN:	2464–4625 (online) 2464–4617 (print)
Keywords:	velké jazykové modely;načítání obrázků;obrazovo-textový dataset;CLIP;kontrastivní učení;k jako clustering
Keywords in different language:	large language models;image retrieval;image-text dataset;CLIP;contrastive learning;k-Means Clustering
Abstract in different language:	In this study, we proposed a method for automatically generating high-quality CLIP(Contrastive Language Image Pre-training) training data to improve the performance of text-based image retrieval using CLIP. In general, two types of image-text pair data are used in CLIP training: correct pairs and incorrect pairs. correct pairs are pairs in which the image and text content are compatible, and are created by scraping or other methods. incorrect pairs are incompatible image-text pairs, which are created by changing the combination of the correct pairs. CLIP is completed by contrastive training to increase the similarity between the image and text in correct pairs and decrease the similarity in incorrect pairs. However, when there are multiple images in the training data that are similar to each other, the text attached to them is also considered to be similar to each other, and although it is preferable to treat them as correct pairs, changed pairs are treated as incorrect pairs. In other words, incorrect pairs with high relevance between image texts are learned as having low relevance between image texts, and this inconsistency has a negative impact on the CLIP model. Therefore, if two images taken from the training data are not similar, then the similarity between texts assigned to them should also be low, so that a highly reliable incorrect pair can be created by exchanging the assigned text with each other. We applied this idea to the results of clustering the images and texts in the training data, respectively, and used the similarity between the clusters to generate an incorrect pair, then learned to increase the negative effect as the similarity between images was lower. The results of an experiment using the Amazon review dataset, which is commonly used in this field, showed a 21.0% improvement in Rank@1 score compared to vanilla CLIP
Rights:	© Václav Skala - UNION Agency
Appears in Collections:	WSCG 2024: Full Papers Proceedings

Files in This Item:

File	Description	Size	Format
C02-2024.pdf	Plný text	7,43 MB	Adobe PDF	View/Open

Show full item record

Please use this identifier to cite or link to this item: http://hdl.handle.net/11025/57391

search

navigation