Summary information

Study title

The national corpus of contemporary Welsh, 2016-2020

Creator

Knight, D, Cardiff University
Morris, S, Swansea University
Fitzpatrick, T, Swansea University
Rayson, P, Lancaster University
Spasić, I, Cardiff University
Thomas, E, Bangor University
Lovell, A, Swansea University
Morris, J, Cardiff University
Evas, J, Cardiff University
Stonelake, M, Swansea University
Arman, L, Cardiff University
Davies, J, Bangor University
Ezeani, I, Lancaster University
Neale, S, Cardiff University
Needs, J, Swansea University
Piao, S, Lancaster University
Rees, M, Swansea University
Watkins, G, Cardiff University
Williams, L, Cardiff University
Muralidaran, V, Cardiff University
Tovey-Walsh, B, Swansea University
Anthony, L, Waseda University
Cobb, T, University of Quebec at Montreal
Deuchar, M, University of Cambridge
Donnelly, K, N/A
McCarthy, M, The University of Nottingham
Scannell, K, Saint Louis University

Study number / PID

854531 (UKDA)

10.5255/UKDA-SN-854531 (DOI)

Data access

Open

Series

Not available

Abstract

The CorCenCC corpus contains over 11 million words (circa 14.4m tokens). CorCenCC is the first corpus of the Welsh language that covers all three aspects of contemporary Welsh: spoken, written and electronically mediated (e-language). It offers a snapshot of the Welsh language across a range of contexts of use, e.g. private conversations, group socialising, business and other work situations, in education, in the various published media, and in public spaces. It includes examples of news headlines, personal and professional emails and correspondence, academic writing, formal and informal speech, blog posts and text messaging. Language data was sampled from a range of different speakers and users of Welsh, from all regions of Wales, of all ages and genders, with a wide range of occupations, and with a variety of linguistic backgrounds (e.g. how they came to speak Welsh), to reflect the diversity of text types and of Welsh speakers found in contemporary Wales. In this way, the CorCenCC corpus provides the means for empowering users of Welsh to better understand and observe the language across diverse settings, and creates a solid evidence base for the teaching of contemporary Welsh to those who aspire to use it. Over time, the corpus has the potential to make a significant contribution to the transformation of Welsh as the language of public, commercial, education and governmental discourse. A beta version of some bilingual corpus query tools have also been created as part of the CorCenCC project (see Related Resources). These include simple query, full query, frequency list, n-gram, keyword and collocation functionalities. The CorCenCC website also contains Y Tiwtiadur, a collection of data-driven teaching and learning tools designed to help supplement Welsh language learning at all different ages and levels. Y Tiwtiadur contains four distinct corpus-based exercises: Gap Filling (Cloze), Vocabulary Profiler, Word Identification and Word-in-Context. To access this...
Read more

Methodology

Data collection period

01/03/2016 - 30/11/2020

Country

Wales

Time dimension

Not available

Analysis unit

Individual
Organization
Family
Family: Household family

Universe

Not available

Sampling procedure

Not available

Kind of data

Text

Data collection mode

A sampling frame was created to underpin the data collection for the project, to ensure that we captured a range of different speakers across different discourse contexts and geographical locations. The sampling frame was designed to reflect current demographics of Welsh speakers to ensure that it reflects the contemporary sociolinguistic situation of the language as accurately as possible.Spoken data was sourced via two main approaches: (i) recruitment of participants to be recorded and (ii) recruitment of participants to contribute spoken data via a novel CorCenCC crowdsourcing app. The scope of (i) included not only research assistants going into the field to record speakers but also participants recording themselves in various interactions. This was facilitated through a network of local 'champions' (active language animateurs in targeted areas) or the Mentrau Iaith (each local authority in Wales has an associated Menter Iaith, i.e. community-based organisation dedicated to raising the profile of the Welsh language local language initiatives). Recruitment for (ii) was achieved by publicising the app (for example through social media, television appearances and publicity materials) to endeavour to reach a different cohort of participants who would be recording individually and in more private domains. Large Welsh language events such as the National Eisteddfod and Tafwyl provided opportunities for the team to reach a large cross-section of participants as well as raise general awareness of the project.The crowdsourcing app was made available on IoS, Android and via a web-interface, and campaigns in the media e.g. appearance on television programmes such as S4C's Prynhawn Da, on both Welsh and English medium radio and through local engagement events. Promotional material included pens, coasters, leaflets and postcard size information sheets. An 'unofficial' mascot - based on a cat called Cor-pws - was designed to facilitate the participation of those under 18 and proved popular with contributors of all ages. Facebook and Twitter accounts for CorCenCC were set up in the first months of the project to further enhance the recruitment and participation of contributors.Novel transcription conventions were devised for processing CorCenCC's spoken data (which was captured via the CorCenCC crowdsourcing app or manually, using audio recording devices). These conventions enabled us to fully reflect the whole spectrum of dialect/register variation captured in our speech data (making them more useful to academic researchers) as well as more accurately representing the speech of participants itself. In terms of written data, the good relationship forged at the beginning of the project with Welsh language publishers such as Gwasg y Lolfa led to the incorporation into the corpus of many up to date novels and books. A unique source of written data in the Welsh language is the locally based Papurau Bro (i.e. local community Welsh-language newspapers). Fairly rapid data capture, for example, sampling from the Welsh language academic journal Gwerddon through the Coleg Cymraeg Cenedlaethol and adult L2 pedagogical resources / examination papers through the Welsh Joint Education Committee resulted from our engagement with other project stakeholders in the planning process for the project.Regarding e-language data, website owners and blog authors cooperated generously and targets were exceeded. Contributors of SMS messages and emails were recruited in the same way as for the spoken data. All relevant participant information and descriptive metadata was recorded at the time of data collection. Permissions to share the data in an online public resource were essential to the development of CorCenCC. These permissions were obtained from the relevant legal entities (e.g. the copyright owner; the speaker themselves) before the data was collected and locally stored. The raw data together with the corresponding permissions and metadata were deposited into a local file storage system.

Funding information

Grant number

ES/M011348/1

Access

Publisher

UK Data Service

Publication year

2021

Terms of data access

The Data Collection is available from an external repository. Access is available via Related Resources.

Related publications

Not available