Arabic Speech Corpus

The Arabic Speech Corpus is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with recorded speech on the phoneme level. The annotations include word stress marks on the individual phonemes.^[1]

The Arabic Speech Corpus was built as part of a doctoral project by Nawar Halabi at the University of Southampton funded by MicroLinkPC who own an exclusive license to commercialise the corpus, but the corpus is available for strictly non-commercial purposes through the official Arabic Speech Corpus website. It is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.^[2]

Purpose

The corpus was mainly built for speech synthesis purposes, specifically Speech Synthesis, but the corpus has been used for building HMM based voices in Arabic. It was also used to automatically align other speech corpora with their phonetic transcript and could be used as part of a larger corpus for training speech recognition systems.^[1]

1813 .wav files containing spoken utterances.
1813 .lab files containing text utterances.
1813 .TextGrid files containing the phoneme labels with time stamps of the boundaries where these occur in the .wav files.
phonetic-transcript.txt which has the form "[wav_filename]" "[Phoneme Sequence]" in every line.
orthographic-transcript.txt which has the form "[wav_filename]" "[Orthographic Transcript]" in every line. Orthography is in Buckwalter Format which is friendlier where there is software that does not read Arabic script. It can be easily converted back to Arabic.
There is an extra 18 minutes of fully annotated corpus (separate from above but with the same structure as above) which was used to evaluated the corpus (see PhD thesis).

The corpus was also used to prove that using automatically extracted, orthography-based stress marks^[3] improve the quality of speech synthesis in MSA.

References

^ ^a ^b Halabi, Nawar (2016). Modern Standard Arabic Phonetics for Speech Synthesis (PDF) (PhD Thesis). University of Southampton, School of Electronics and Computer Science.
^ Halabi, Nawar (2016), Arabic Speech Corpus (Web Page), University of Oxford
^ Halpern, Jack (2009). Word Stress and Vowel Neutralization in Modern Standard Arabic (PDF). 2nd International Conference on Arabic Language Resources and Tools. Cairo.

External links

[HALABI2016-1] Halabi, Nawar (2016). Modern Standard Arabic Phonetics for Speech Synthesis (PDF) (PhD Thesis). University of Southampton, School of Electronics and Computer Science.

[OX2016-2] Halabi, Nawar (2016), Arabic Speech Corpus (Web Page), University of Oxford

[HALPERN2009-3] Halpern, Jack (2009). Word Stress and Vowel Neutralization in Modern Standard Arabic (PDF). 2nd International Conference on Arabic Language Resources and Tools. Cairo.

[1]

[2]

[3]

v t e Corpus linguistics
Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus EnTenTen International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus Switchboard Telephone Speech Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English
Text corpora, non-English	Bijankhan Corpus CHILDES CorCenCC National Corpus of Contemporary Welsh Croatian Language Corpus Croatian National Corpus Czech National Corpus Europarl Corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Persian Speech Corpus Quranic Arabic Corpus Russian National Corpus Somali Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto TenTen Corpus Family Thesaurus Linguae Graecae
Organizations	BNC consortium COBUILD Sketch Engine

Arabic Speech Corpus

Purpose

Contents

See also

References

External links