Corpus of Contemporary American English
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)
|
The Corpus of Contemporary American English (COCA) is a more than 1-billion-word corpus [1]of contemporary American English. It was created by Mark Davies, retired professor of Corpus Linguistics at Brigham Young University (BYU)[2].[3]
Content
The Corpus of Contemporary American English (COCA) is composed of more than 1-billion words as of November 2021[1][2][4]. The corpus is constantly growing: In 2009 it contained more than 385 million words[5]; In 2010 the corpus grew in size to 400 million words[6]; By March 2019[7], the corpus had grown to 560 million words[8]; and by December 2019 the corpus had reached 1 billion words[2] .
As of November 2021, the Corpus of Contemporary American English is composed of 485,202 texts [9]. According to the corpus website (https://www.english-corpora.org/coca/), the current corpus (November 2021) is composed of texts that include 24-25 million words for each year from 1990-2019.
The Corpus of Contemporary American English is used by researchers from all over the world and has been directly cited over 3,000 times according to Mark Davies' Google Scholar profile [10]. October 2021 alone, the Corpus of Contemporary American English was used by 77,605 people (see Figure 1).
For each year contained in the corpus (1990-2019), the corpus is evenly divided between six registers/genres: tv/ movies, spoken, fiction, magazine, newspaper, and academic (see Texts and Registers page of the COCA website). In addition to the six registers that were previously listed, COCA (as of November 2021) also contains 125,496,215 words from blogs, and 129,899,426 from websites, making it a corpus that is truly composed of contemporary English (see Texts and Register page of COCA)[9].
The texts come from a variety of sources:
- Spoken: (85 million words) Transcripts of unscripted conversation from nearly 150 different TV and radio programs.
- Fiction: (81 million words) Short stories and plays, first chapters of books 1990–present, and movie scripts.
- Popular magazines: (86 million words) Nearly 100 different magazines, from a range of domains such as news, health, home and gardening, women's, financial, religion, and sports.
- Newspapers: (81 million words) Ten newspapers from across the US, with text from different sections of the newspapers, such as local news, opinion, sports, and the financial section.
- Academic Journals: (81 million words) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress Classification system.
Availability
The Corpus of Contemporary American English is free to search through using its web interface, users are just required to register a free account (see the Register page of COCA)[11]. The free account allows you a limited number of queries per day. A less-restricted access is available at cost (see the Premium section of the Upgrade page of English-Corpora.org)[12]. The full text corpus data is available at an additional cost[13]. All of the English Corpora are available for a nominal fee for academic groups through academic group license[14].
Queries
- The interface is the same as the BYU-BNC interface for the 100 million word British National Corpus, the 100 million word TIME Magazine corpus, and the 400 million word Corpus of *Historical* American English (COHA), the 1810s–2000s (see links below)
- Queries by word, phrase, alternates, substring, part of speech, lemma, synonyms (see below), and customized lists (see below)
- The corpus is tagged by CLAWS, the same part of speech tagger that was used for the BNC and the TIME corpus
- Chart listings (totals for all matching forms in each genre or year, 1990–present, as well as for subgenres) and table listings (frequency for each matching form in each genre or year)
- Full collocates searching (up to ten words left and right of node word)
- Re-sortable concordances, showing the most common words/strings to the left and right of the searched word
- Comparisons between genres or time periods (e.g. collocates of 'chair' in fiction or academic, nouns with 'break the [N]' in newspapers or academic, adjectives that occur primarily in sports magazines, or verbs that are more common 2005–2010 than previously)
- One-step comparisons of collocates of related words, to study semantic or cultural differences between words (e.g. comparison of collocates of 'small', 'little', 'tiny', 'miniscule', or lilliputian or 'Democrats' and 'Republicans', or 'men' and 'women', or 'rob' vs 'steal')
- Users can include semantic information from a 60,000 entry thesaurus directly as part of the query syntax (e.g. frequency and distribution of synonyms of 'beautiful', synonyms of 'strong' occurring in fiction but not academic, synonyms of 'clean' + noun ('clean the floor', 'washed the dishes'))
- Users can also create their own 'customized' word lists, and then re-use these as part of subsequent queries (e.g. lists related to a particular semantic category (clothes, foods, emotions), or a user-defined part of speech)
- Note that the corpus is available only through the web interface, due to copyright restrictions.
Related
The corpus of Global Web-based English (GloWbE; pronounced "globe") contains about 1.9 billion words of text from twenty different countries. This makes it about 100 times as large as other corpora like the International Corpus of English, and it allows for many types of searches that would not be possible otherwise. In addition to this online interface, you can also download full-text data from the corpus.
It is unique in the way that it allows one to carry out comparisons between different varieties of English. GloWbE is related to the many other corpora of English.[15]
See also
Bibliography
- Davies, Mark (2010). "The Corpus of Contemporary American English as the First Reliable Monitor Corpus of English". Literary and Linguistic Computing. 25 (4): 447–65. doi:10.1093/llc/fqq018.
- Bennett, Gena R. (2010). Using Corpora in the Language Learning Classroom: Corpus Linguistics for Teachers. Ann Arbor, Michigan: University of Michigan. p. 144. ISBN 978-0-472-03385-0.
- Davies, Mark (2010). "More than a peephole: Using large and diverse online corpora". International Journal of Corpus Linguistics. 15 (3): 405–11. doi:10.1075/ijcl.15.3.13dav.
- Anderson, Wendy; Corbett, John (2009), Exploring English with Online Corpora, Palgrave Macmillan, p. 205, ISBN 978-0-230-55140-4
- Davies, Mark (2009). "The 385+ Million Word Corpus of Contemporary American English (1990–present)". International Journal of Corpus Linguistics. 14 (2). John Benjamins Publishing Company: 159–190(32). doi:10.1075/ijcl.14.2.02dav.
- Lindquist, Hans (2009). Corpus Linguistics and the Description of English. Edinburgh University Press. ISBN 978-0-7486-2615-1.
- Davies, Mark (2005). "The advantage of using relational databases for large corpora: Speed, advanced queries, and unlimited annotation". International Journal of Corpus Linguistics. 10 (3). John Benjamins Publishing Company: 307–334(28). doi:10.1075/ijcl.10.3.02dav.
References
- ^ a b Milana, Prior, (2021). "A Comparative Corpus Study on Intensifier Usage across Registers in American English".
{{cite journal}}
: Cite journal requires|journal=
(help)CS1 maint: extra punctuation (link) CS1 maint: multiple names: authors list (link) - ^ a b c "Mark Davies, Professor of (Corpus) Linguistics, Brigham Young University (BYU)". www.mark-davies.org. Retrieved 2021-11-09.
- ^ Kauhanen, Henri (2011-03-21). "The Corpus of Contemporary American English: Background and history". VARIENG. Retrieved 2011-10-13.
- ^ [1] official website of COCA
- ^ Davies, Mark (2009-01-01). "The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights". International Journal of Corpus Linguistics. 14 (2): 159–190. doi:10.1075/ijcl.14.2.02dav. ISSN 1384-6655.
- ^ Davies, Mark (2010-12-01). "The Corpus of Contemporary American English as the first reliable monitor corpus of English". Literary and Linguistic Computing. 25 (4): 447–464. doi:10.1093/llc/fqq018. ISSN 0268-1145.
- ^ Davies, Mark; Kim, Jong Bok (2019-03-01). "The advantages and challenges of "big data": Insights from the 14 billion word iWeb corpus". Linguistic Research. 36 (1): 1–34. doi:10.17250/khisli.36.1.201903.001. ISSN 1229-1374.
- ^ Davies, Mark; Kim, Jong-Bok (March 2019). "The advantages and challenges of "big data":
Insights from the 14 billion word iWeb corpus". Linguistic Research. 36 (1): 1–34. doi:10.17250/khisli.36.1.201903.001 – via ProQuest.
{{cite journal}}
: line feed character in|title=
at position 45 (help) - ^ a b "Corpus of Contemporary American English (COCA)". www.english-corpora.org. Retrieved 2021-11-08.
- ^ "Mark Davies". scholar.google.com. Retrieved 2021-11-09.
- ^ "Corpus of Contemporary American English". Corpus of Contemporary American English. Retrieved 20 July 2017.
- ^ "English Corpora Premium". English Corpora. Retrieved 8 November 2021.
{{cite web}}
: CS1 maint: url-status (link) - ^ "Corpus data: Purchase". Retrieved 20 July 2017.
- ^ "English Corpora: most widely used online corpora. Billions of words of data: free online access". www.english-corpora.org. Retrieved 2021-11-09.
- ^ "Corpus of Web-Based Global English". www.english-corpora.org. Retrieved 2019-12-18.