TenTen Corpus Family: Difference between revisions

Content deleted Content added

Inline

Revision as of 13:56, 13 June 2017

The TenTen Corpus Family or also called as TenTen corpora is a set of web text corpora which are comparable. It means that corpora are crawled from the Web and prepared according to the same manual.

Description

This corpus project was prepared in pursuance of a new generation of Web corpora in the corpus manager Sketch Engine. Corpora are created by the Web crawling method and after that processed with the tool for tools for natural language processing developed in Natural Language Processing Centre at the Faculty of Informatics at Masaryk University, Brno, the Czech Republic. and Lexical Computing developing Sketch Engine. The "TenTen" idea designates the target sizes of the corpora with 10¹⁰ (10 billion) words.^[1]

Process of the preparation of TenTen corpora

Firstly, the text data for a corpus are downloaded from the Internet by the spider SpiderLing.^[2] Then these texts are cleaned (removed non-textual material such as navigation links, headers, and footers from HTML pages) with the tool jusText^[3] in order to preserve only full sentences. Finally, the tool ONION^[3] is used to removing duplicate parts from texts because of many duplicated contents on the Internet (citing, copying, referring).^[1]

This approach to creating corpora is based on the previous works with the aim of preparing web corpora and their subsequent processing.^[4]^[5]^[6]

Structural attributes

A list of structural attributes, meta-information describing e.g. web domain of a text or its date of crawling, shared by all TenTen corpora. Some of the corpora can have specific attributes.

Document attributes

top level domain – domain at the highest level in the hierarchical Domain Name System (e.g. "com")
website – identification string defining a realm of administrative autonomy within the Internet (e.g. "wikipedia.org")
web domain – collection of related web pages (e.g. "la.wikipedia.org")
crawl date – date of downloading the document from the web
url – URL of a source document
wordcount – number of words of a document
length – length of a document in thousands of words

Paragraph attributes

heading – number distinguishing headlines from other texts (1 means heading, 0 other texts)
heading = 1 if the paragraph is a heading, 0 otherwise

Available corpora

The following corpora are available in Sketch Engine to spring 2017.

arTenTen (Arabic web corpus)^[7]
bgTenTen (Bulgarian web corpus)^[8]
caTenTen (Catalan web corpus)
czTenTen (Czech web corpus)^[9]
daTenTen (Danish web corpus)
deTenTen (German web corpus)
elTenTen (Greek web corpus)
enTenTen (English web corpus)
esAmTenTen (American Spanish web corpus)^[10]
esTenTen (Spanish web corpus)^[11]
etTenTen (Estonian web corpus)^[12]
fiTenTen (Finnish web corpus)
frTenTen (French web corpus)
heTenTen (Hebrew web corpus)
huTenTen (Hungarian web corpus)
itTenTen (Italian web corpus)
jpTenTen (Japanese web corpus)
koTenTen (Korean web corpus)
ltTenTen (Lithuanian web corpus)
lvTenTen (Latvian web corpus)
nlTenTen (Dutch web corpus)
noTenTen (Norwegian web corpus)
plTenTen (Polish web corpus)
ptTenTen (Portuguese web corpus)
rotenten (Romanian web corpus)
ruTenTen (Russian web corpus)
skTenTen (Slovak web corpus)
slTenTen (Slovene web corpus)
svTenTen (Swedish web corpus)
trTenTen (Turkish web corpus)
uaTenTen (Ukrainian web corpus)
yoTenTen (Yoruba web corpus)
zhTenTen (Chinese Simplified characters web corpus)

References

^ ^a ^b Jakubíček, Miloš; Kilgarriff, Adam; Kovář, Vojtěch; Rychlý, Pavel; Suchomel, Vít (July 2013). The Tenten Corpus Family (PDF). 7th International Corpus Linguistics Conference CL. Lancaster, UK: Lancaster University. pp. 125–127. Retrieved 13 June 2017.
^ Suchomel, Vít; Pomikálek, Jan (17 April 2012). "Efficient web crawling for large text corpora" (PDF). Proceedings of the seventh Web as Corpus Workshop (WAC7). 7th Web as Corpus Workshop. Lyon, France: Association for Computational Linguistics (ACL) on Web as Corpus. pp. 39–43. Retrieved 13 June 2017.
^ ^a ^b Pomikálek, Jan (2011). Removing boilerplate and duplicate content from web corpora (PhD). Faculty of Informatics, Masaryk University. Retrieved 17 April 2017.
^ Baroni, Marco; Kilgarriff, Adam; Kovář, Vojtěch; Rychlý, Pavel; Suchomel, Vít (July 2013). Large linguistically-processed web corpora for multiple languages (PDF). 11th Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations. Association for Computational Linguistics. Trento, Italy: Lancaster University. pp. 87–90. Retrieved 13 June 2017.
^ Kilgarriff, Adam; Reddy, Siva; Pomikálek, Jan; Avinesh, PVS (May 2010). A Corpus Factory for Many Languages. 7th Language Resources and Evaluation Conference. Valletta, Malta: ELRA. Retrieved 13 June 2017.
^ Sharoff, Serge (2006). "Creating general-purpose corpora using automated search engine queries". In Baroni, Marco; Bernardini, Silvia (eds.). Wacky! Working papers on the Web as Corpus (PDF). Bologna, Italy: GEDIT. pp. 63–98. ISBN 88-6027-004-9.
^ Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTen-Ten: a new, vast corpus for Arabic. Proceedings of WACL.
^ Kilgarriff, A., Jakubíček, M., Pomikalek, J., Sardinha, T. B., & Whitelock, P. (2014). PtTenTen: a corpus for Portuguese lexicography. Working with Portuguese Corpora, 111-30.
^ Suchomel, V. (2012, December). Recent czech web corpora. In 6th Workshop on Recent Advances in Slavonic Natural Language Processing, Brno, Tribun EU (pp. 77-83)
^ Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia-Social and Behavioral Sciences, 95, 12-19.
^ Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia-Social and Behavioral Sciences, 95, 12-19.
^ SRDANOVIĆ, I. (2016). A Research Project on Language Resources for Learners of Japanese. Inter Faculty, 6.

External links

TenTen Corpus Family (on the Sketch Engine web)

[tenten-1] Jakubíček, Miloš; Kilgarriff, Adam; Kovář, Vojtěch; Rychlý, Pavel; Suchomel, Vít (July 2013). The Tenten Corpus Family (PDF). 7th International Corpus Linguistics Conference CL. Lancaster, UK: Lancaster University. pp. 125–127. Retrieved 13 June 2017.

[crawling-2] Suchomel, Vít; Pomikálek, Jan (17 April 2012). "Efficient web crawling for large text corpora" (PDF). Proceedings of the seventh Web as Corpus Workshop (WAC7). 7th Web as Corpus Workshop. Lyon, France: Association for Computational Linguistics (ACL) on Web as Corpus. pp. 39–43. Retrieved 13 June 2017.

[justext-3] Pomikálek, Jan (2011). Removing boilerplate and duplicate content from web corpora (PhD). Faculty of Informatics, Masaryk University. Retrieved 17 April 2017.

[4] Baroni, Marco; Kilgarriff, Adam; Kovář, Vojtěch; Rychlý, Pavel; Suchomel, Vít (July 2013). Large linguistically-processed web corpora for multiple languages (PDF). 11th Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations. Association for Computational Linguistics. Trento, Italy: Lancaster University. pp. 87–90. Retrieved 13 June 2017.

[5] Kilgarriff, Adam; Reddy, Siva; Pomikálek, Jan; Avinesh, PVS (May 2010). A Corpus Factory for Many Languages. 7th Language Resources and Evaluation Conference. Valletta, Malta: ELRA. Retrieved 13 June 2017.

[6] Sharoff, Serge (2006). "Creating general-purpose corpora using automated search engine queries". In Baroni, Marco; Bernardini, Silvia (eds.). Wacky! Working papers on the Web as Corpus (PDF). Bologna, Italy: GEDIT. pp. 63–98. ISBN 88-6027-004-9.

[7] Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTen-Ten: a new, vast corpus for Arabic. Proceedings of WACL.

[8] Kilgarriff, A., Jakubíček, M., Pomikalek, J., Sardinha, T. B., & Whitelock, P. (2014). PtTenTen: a corpus for Portuguese lexicography. Working with Portuguese Corpora, 111-30.

[9] Suchomel, V. (2012, December). Recent czech web corpora. In 6th Workshop on Recent Advances in Slavonic Natural Language Processing, Brno, Tribun EU (pp. 77-83)

[10] Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia-Social and Behavioral Sciences, 95, 12-19.

[11] Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia-Social and Behavioral Sciences, 95, 12-19.

[12] SRDANOVIĆ, I. (2016). A Research Project on Language Resources for Learners of Japanese. Inter Faculty, 6.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

Revision as of 13:55, 13 June 2017 edit Epheson (talk \| contribs) 118 edits m Epheson moved page The TenTen Corpus Family to TenTen Corpus Family: name without article ← Previous edit		Revision as of 13:56, 13 June 2017 edit undo Epheson (talk \| contribs) 118 edits mNo edit summary Next edit →
Line 1:		Line 1:
	'''~~The~~ TenTen Corpus Family''' or also called as '''TenTen corpora''' is a set of web [[text corpora]] which are comparable. It means that corpora are crawled from the Web and prepared according to the same manual.		The '''TenTen Corpus Family''' or also called as '''TenTen corpora''' is a set of web [[text corpora]] which are comparable. It means that corpora are crawled from the Web and prepared according to the same manual.

	== Description ==		== Description ==

v t e Corpus linguistics
Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus EnTenTen International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus Switchboard Telephone Speech Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English
Text corpora, non-English	Bijankhan Corpus CHILDES CorCenCC National Corpus of Contemporary Welsh Croatian Language Corpus Croatian National Corpus Czech National Corpus Europarl Corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Persian Speech Corpus Quranic Arabic Corpus Russian National Corpus Somali Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto TenTen Corpus Family Thesaurus Linguae Graecae
Organizations	BNC consortium COBUILD Sketch Engine