Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of Czech

Lehečka, Jan; Švec, Jan; Pražák, Aleš; Psutka, Josef

Full metadata record

DC pole	Hodnota	Jazyk
dc.contributor.author	Lehečka, Jan
dc.contributor.author	Švec, Jan
dc.contributor.author	Pražák, Aleš
dc.contributor.author	Psutka, Josef
dc.date.accessioned	2023-01-30T11:00:27Z	-
dc.date.available	2023-01-30T11:00:27Z	-
dc.date.issued	2022
dc.identifier.citation	LEHEČKA, J. ŠVEC, J. PRAŽÁK, A. PSUTKA, J. Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of Czech. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. New York: Red Hook, 2022. s. 1831-1835. ISBN: neuvedeno , ISSN: 2308-457X	cs
dc.identifier.isbn	neuvedeno
dc.identifier.issn	2308-457X
dc.identifier.uri	2-s2.0-85139048808
dc.identifier.uri	http://hdl.handle.net/11025/51163
dc.format	5 s.	cs
dc.format.mimetype	application/pdf
dc.language.iso	en	en
dc.publisher	International Speech Communication Association	en
dc.relation.ispartofseries	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH	en
dc.rights	Plný text není přístupný.	cs
dc.rights	© 2022 ISCA	en
dc.title	Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of Czech	en
dc.type	konferenční příspěvek	cs
dc.type	ConferenceObject	en
dc.rights.access	closedAccess	en
dc.type.version	publishedVersion	en
dc.description.abstract-translated	In this paper, we present our progress in pretraining Czech monolingual audio transformers from a large dataset containing more than 80 thousand hours of unlabeled speech, and subsequently fine-tuning the model on automatic speech recognition tasks using a combination of in-domain data and almost 6 thousand hours of out-of-domain transcribed speech. We are presenting a large palette of experiments with various fine-tuning setups evaluated on two public datasets (CommonVoice and VoxPopuli) and one extremely challenging dataset from the MALACH project. Our results show that monolingual Wav2Vec 2.0 models are robust ASR systems, which can take advantage of large labeled and unlabeled datasets and successfully compete with state-of-the-art LVCSR systems. Moreover, Wav2Vec models proved to be good zero-shot learners when no training data are available for the target ASR task.	en
dc.subject.translated	speech recognition, audio transformers, Wav2Vec	en
dc.identifier.doi	10.21437/Interspeech.2022-10439
dc.type.status	Peer-reviewed	en
dc.identifier.obd	43936705
dc.project.ID	90140/Velká výzkumná infrastruktura_(J) - e-INFRA CZ	cs
dc.project.ID	GA22-27800S/Využití vícemodálních Transformerů pro přirozenější hlasový dialog	cs
dc.project.ID	EF17_048/0007267/InteCom: VaV inteligentních komponent pokročilých technologií pro plzeňskou metropolitní oblast	cs
Vyskytuje se v kolekcích:	Články / Articles (NTIS) Články / Articles (KKY) OBD

Soubory připojené k záznamu:

Soubor	Velikost	Formát
Lehecka_Svec_Prazak_PsutkaJV-Exploring_Capabilties_Interspeech_2022.pdf	197,58 kB	Adobe PDF	Zobrazit/otevřít Vyžádat kopii

Zobrazit minimální záznam Zobrazit statistiky

Použijte tento identifikátor k citaci nebo jako odkaz na tento záznam: http://hdl.handle.net/11025/51163

Všechny záznamy v DSpace jsou chráněny autorskými právy, všechna práva vyhrazena.

hledání

navigace