TenTen Corpus Family: Difference between revisions
m Epheson moved page The TenTen Corpus Family to TenTen Corpus Family: name without article |
mNo edit summary |
||
Line 1: | Line 1: | ||
''' |
The '''TenTen Corpus Family''' or also called as '''TenTen corpora''' is a set of web [[text corpora]] which are comparable. It means that corpora are crawled from the Web and prepared according to the same manual. |
||
== Description == |
== Description == |
Revision as of 13:56, 13 June 2017
The TenTen Corpus Family or also called as TenTen corpora is a set of web text corpora which are comparable. It means that corpora are crawled from the Web and prepared according to the same manual.
Description
This corpus project was prepared in pursuance of a new generation of Web corpora in the corpus manager Sketch Engine. Corpora are created by the Web crawling method and after that processed with the tool for tools for natural language processing developed in Natural Language Processing Centre at the Faculty of Informatics at Masaryk University, Brno, the Czech Republic. and Lexical Computing developing Sketch Engine. The "TenTen" idea designates the target sizes of the corpora with 1010 (10 billion) words.[1]
Process of the preparation of TenTen corpora
Firstly, the text data for a corpus are downloaded from the Internet by the spider SpiderLing.[2] Then these texts are cleaned (removed non-textual material such as navigation links, headers, and footers from HTML pages) with the tool jusText[3] in order to preserve only full sentences. Finally, the tool ONION[3] is used to removing duplicate parts from texts because of many duplicated contents on the Internet (citing, copying, referring).[1]
This approach to creating corpora is based on the previous works with the aim of preparing web corpora and their subsequent processing.[4][5][6]
Structural attributes
A list of structural attributes, meta-information describing e.g. web domain of a text or its date of crawling, shared by all TenTen corpora. Some of the corpora can have specific attributes.
Document attributes
- top level domain – domain at the highest level in the hierarchical Domain Name System (e.g. "com")
- website – identification string defining a realm of administrative autonomy within the Internet (e.g. "wikipedia.org")
- web domain – collection of related web pages (e.g. "la.wikipedia.org")
- crawl date – date of downloading the document from the web
- url – URL of a source document
- wordcount – number of words of a document
- length – length of a document in thousands of words
Paragraph attributes
- heading – number distinguishing headlines from other texts (1 means heading, 0 other texts)
- heading = 1 if the paragraph is a heading, 0 otherwise
Available corpora
The following corpora are available in Sketch Engine to spring 2017.
- arTenTen (Arabic web corpus)[7]
- bgTenTen (Bulgarian web corpus)[8]
- caTenTen (Catalan web corpus)
- czTenTen (Czech web corpus)[9]
- daTenTen (Danish web corpus)
- deTenTen (German web corpus)
- elTenTen (Greek web corpus)
- enTenTen (English web corpus)
- esAmTenTen (American Spanish web corpus)[10]
- esTenTen (Spanish web corpus)[11]
- etTenTen (Estonian web corpus)[12]
- fiTenTen (Finnish web corpus)
- frTenTen (French web corpus)
- heTenTen (Hebrew web corpus)
- huTenTen (Hungarian web corpus)
- itTenTen (Italian web corpus)
- jpTenTen (Japanese web corpus)
- koTenTen (Korean web corpus)
- ltTenTen (Lithuanian web corpus)
- lvTenTen (Latvian web corpus)
- nlTenTen (Dutch web corpus)
- noTenTen (Norwegian web corpus)
- plTenTen (Polish web corpus)
- ptTenTen (Portuguese web corpus)
- rotenten (Romanian web corpus)
- ruTenTen (Russian web corpus)
- skTenTen (Slovak web corpus)
- slTenTen (Slovene web corpus)
- svTenTen (Swedish web corpus)
- trTenTen (Turkish web corpus)
- uaTenTen (Ukrainian web corpus)
- yoTenTen (Yoruba web corpus)
- zhTenTen (Chinese Simplified characters web corpus)
See also
References
- ^ a b Jakubíček, Miloš; Kilgarriff, Adam; Kovář, Vojtěch; Rychlý, Pavel; Suchomel, Vít (July 2013). The Tenten Corpus Family (PDF). 7th International Corpus Linguistics Conference CL. Lancaster, UK: Lancaster University. pp. 125–127. Retrieved 13 June 2017.
- ^ Suchomel, Vít; Pomikálek, Jan (17 April 2012). "Efficient web crawling for large text corpora" (PDF). Proceedings of the seventh Web as Corpus Workshop (WAC7). 7th Web as Corpus Workshop. Lyon, France: Association for Computational Linguistics (ACL) on Web as Corpus. pp. 39–43. Retrieved 13 June 2017.
- ^ a b Pomikálek, Jan (2011). Removing boilerplate and duplicate content from web corpora (PhD). Faculty of Informatics, Masaryk University. Retrieved 17 April 2017.
- ^ Baroni, Marco; Kilgarriff, Adam; Kovář, Vojtěch; Rychlý, Pavel; Suchomel, Vít (July 2013). Large linguistically-processed web corpora for multiple languages (PDF). 11th Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations. Association for Computational Linguistics. Trento, Italy: Lancaster University. pp. 87–90. Retrieved 13 June 2017.
- ^ Kilgarriff, Adam; Reddy, Siva; Pomikálek, Jan; Avinesh, PVS (May 2010). A Corpus Factory for Many Languages. 7th Language Resources and Evaluation Conference. Valletta, Malta: ELRA. Retrieved 13 June 2017.
- ^ Sharoff, Serge (2006). "Creating general-purpose corpora using automated search engine queries". In Baroni, Marco; Bernardini, Silvia (eds.). Wacky! Working papers on the Web as Corpus (PDF). Bologna, Italy: GEDIT. pp. 63–98. ISBN 88-6027-004-9.
- ^ Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTen-Ten: a new, vast corpus for Arabic. Proceedings of WACL.
- ^ Kilgarriff, A., Jakubíček, M., Pomikalek, J., Sardinha, T. B., & Whitelock, P. (2014). PtTenTen: a corpus for Portuguese lexicography. Working with Portuguese Corpora, 111-30.
- ^ Suchomel, V. (2012, December). Recent czech web corpora. In 6th Workshop on Recent Advances in Slavonic Natural Language Processing, Brno, Tribun EU (pp. 77-83)
- ^ Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia-Social and Behavioral Sciences, 95, 12-19.
- ^ Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia-Social and Behavioral Sciences, 95, 12-19.
- ^ SRDANOVIĆ, I. (2016). A Research Project on Language Resources for Learners of Japanese. Inter Faculty, 6.
External links
- TenTen Corpus Family (on the Sketch Engine web)