Automatic data generation of incorrect image-text pairs for effective contrastive learning of CLIP model

Tagami, Rina; Kobayashi, Hiroki; Akizuki, Shuichi; Hashimoto, Manabu

Full metadata record

DC pole	Hodnota	Jazyk
dc.contributor.author	Tagami, Rina
dc.contributor.author	Kobayashi, Hiroki
dc.contributor.author	Akizuki, Shuichi
dc.contributor.author	Hashimoto, Manabu
dc.contributor.editor	Skala, Václav
dc.date.accessioned	2024-07-28T18:41:49Z	-
dc.date.available	2024-07-28T18:41:49Z	-
dc.date.issued	2024
dc.identifier.citation	WSCG 2024: full papers proceedings: 32. International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, p. 187-196.	en
dc.identifier.issn	2464–4625 (online)
dc.identifier.issn	2464–4617 (print)
dc.identifier.uri	http://hdl.handle.net/11025/57391
dc.format	10 s.	cs
dc.format.mimetype	application/pdf
dc.language.iso	en	en
dc.publisher	Václav Skala - UNION Agency	en
dc.rights	© Václav Skala - UNION Agency	en
dc.subject	velké jazykové modely	cs
dc.subject	načítání obrázků	cs
dc.subject	obrazovo-textový dataset	cs
dc.subject	CLIP	cs
dc.subject	kontrastivní učení	cs
dc.subject	k jako clustering	cs
dc.title	Automatic data generation of incorrect image-text pairs for effective contrastive learning of CLIP model	cs_CZ
dc.title	Automatic data generation of incorrect image-text pairs for effective contrastive learning of CLIP model	en
dc.type	konferenční příspěvek	cs
dc.type	conferenceObject	en
dc.rights.access	openAccess	en
dc.type.version	publishedVersion	en
dc.description.abstract-translated	In this study, we proposed a method for automatically generating high-quality CLIP(Contrastive Language Image Pre-training) training data to improve the performance of text-based image retrieval using CLIP. In general, two types of image-text pair data are used in CLIP training: correct pairs and incorrect pairs. correct pairs are pairs in which the image and text content are compatible, and are created by scraping or other methods. incorrect pairs are incompatible image-text pairs, which are created by changing the combination of the correct pairs. CLIP is completed by contrastive training to increase the similarity between the image and text in correct pairs and decrease the similarity in incorrect pairs. However, when there are multiple images in the training data that are similar to each other, the text attached to them is also considered to be similar to each other, and although it is preferable to treat them as correct pairs, changed pairs are treated as incorrect pairs. In other words, incorrect pairs with high relevance between image texts are learned as having low relevance between image texts, and this inconsistency has a negative impact on the CLIP model. Therefore, if two images taken from the training data are not similar, then the similarity between texts assigned to them should also be low, so that a highly reliable incorrect pair can be created by exchanging the assigned text with each other. We applied this idea to the results of clustering the images and texts in the training data, respectively, and used the similarity between the clusters to generate an incorrect pair, then learned to increase the negative effect as the similarity between images was lower. The results of an experiment using the Amazon review dataset, which is commonly used in this field, showed a 21.0% improvement in Rank@1 score compared to vanilla CLIP	en
dc.subject.translated	large language models	en
dc.subject.translated	image retrieval	en
dc.subject.translated	image-text dataset	en
dc.subject.translated	CLIP	en
dc.subject.translated	contrastive learning	en
dc.subject.translated	k-Means Clustering	en
dc.identifier.doi	https://doi.org/10.24132/10.24132/CSRN.3401.20
dc.type.status	Peer reviewed	en
Vyskytuje se v kolekcích:	WSCG 2024: Full Papers Proceedings

Soubory připojené k záznamu:

Soubor	Popis	Velikost	Formát
C02-2024.pdf	Plný text	7,43 MB	Adobe PDF	Zobrazit/otevřít

Zobrazit minimální záznam Zobrazit statistiky

Použijte tento identifikátor k citaci nebo jako odkaz na tento záznam: http://hdl.handle.net/11025/57391

Všechny záznamy v DSpace jsou chráněny autorskými právy, všechna práva vyhrazena.

hledání

navigace