Full metadata record
DC poleHodnotaJazyk
dc.contributor.authorTagami, Rina
dc.contributor.authorKobayashi, Hiroki
dc.contributor.authorAkizuki, Shuichi
dc.contributor.authorHashimoto, Manabu
dc.contributor.editorSkala, Václav
dc.date.accessioned2024-07-28T18:41:49Z-
dc.date.available2024-07-28T18:41:49Z-
dc.date.issued2024
dc.identifier.citationWSCG 2024: full papers proceedings: 32. International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, p. 187-196.en
dc.identifier.issn2464–4625 (online)
dc.identifier.issn2464–4617 (print)
dc.identifier.urihttp://hdl.handle.net/11025/57391
dc.format10 s.cs
dc.format.mimetypeapplication/pdf
dc.language.isoenen
dc.publisherVáclav Skala - UNION Agencyen
dc.rights© Václav Skala - UNION Agencyen
dc.subjectvelké jazykové modelycs
dc.subjectnačítání obrázkůcs
dc.subjectobrazovo-textový datasetcs
dc.subjectCLIPcs
dc.subjectkontrastivní učenícs
dc.subjectk jako clusteringcs
dc.titleAutomatic data generation of incorrect image-text pairs for effective contrastive learning of CLIP modelcs_CZ
dc.titleAutomatic data generation of incorrect image-text pairs for effective contrastive learning of CLIP modelen
dc.typekonferenční příspěvekcs
dc.typeconferenceObjecten
dc.rights.accessopenAccessen
dc.type.versionpublishedVersionen
dc.description.abstract-translatedIn this study, we proposed a method for automatically generating high-quality CLIP(Contrastive Language Image Pre-training) training data to improve the performance of text-based image retrieval using CLIP. In general, two types of image-text pair data are used in CLIP training: correct pairs and incorrect pairs. correct pairs are pairs in which the image and text content are compatible, and are created by scraping or other methods. incorrect pairs are incompatible image-text pairs, which are created by changing the combination of the correct pairs. CLIP is completed by contrastive training to increase the similarity between the image and text in correct pairs and decrease the similarity in incorrect pairs. However, when there are multiple images in the training data that are similar to each other, the text attached to them is also considered to be similar to each other, and although it is preferable to treat them as correct pairs, changed pairs are treated as incorrect pairs. In other words, incorrect pairs with high relevance between image texts are learned as having low relevance between image texts, and this inconsistency has a negative impact on the CLIP model. Therefore, if two images taken from the training data are not similar, then the similarity between texts assigned to them should also be low, so that a highly reliable incorrect pair can be created by exchanging the assigned text with each other. We applied this idea to the results of clustering the images and texts in the training data, respectively, and used the similarity between the clusters to generate an incorrect pair, then learned to increase the negative effect as the similarity between images was lower. The results of an experiment using the Amazon review dataset, which is commonly used in this field, showed a 21.0% improvement in Rank@1 score compared to vanilla CLIPen
dc.subject.translatedlarge language modelsen
dc.subject.translatedimage retrievalen
dc.subject.translatedimage-text dataseten
dc.subject.translatedCLIPen
dc.subject.translatedcontrastive learningen
dc.subject.translatedk-Means Clusteringen
dc.identifier.doihttps://doi.org/10.24132/10.24132/CSRN.3401.20
dc.type.statusPeer revieweden
Vyskytuje se v kolekcích:WSCG 2024: Full Papers Proceedings

Soubory připojené k záznamu:
Soubor Popis VelikostFormát 
C02-2024.pdfPlný text7,43 MBAdobe PDFZobrazit/otevřít


Použijte tento identifikátor k citaci nebo jako odkaz na tento záznam: http://hdl.handle.net/11025/57391

Všechny záznamy v DSpace jsou chráněny autorskými právy, všechna práva vyhrazena.