Summary information

Study title

The national corpus of contemporary Welsh, 2016-2020

Creator

Knight, D, Cardiff University

Morris, S, Swansea University

Fitzpatrick, T, Swansea University

Rayson, P, Lancaster University

Spasić, I, Cardiff University

Thomas, E, Bangor University

Lovell, A, Swansea University

Morris, J, Cardiff University

Evas, J, Cardiff University

Stonelake, M, Swansea University

Arman, L, Cardiff University

Davies, J, Bangor University

Ezeani, I, Lancaster University

Neale, S, Cardiff University

Needs, J, Swansea University

Piao, S, Lancaster University

Rees, M, Swansea University

Watkins, G, Cardiff University

Williams, L, Cardiff University

Muralidaran, V, Cardiff University

Tovey-Walsh, B, Swansea University

Anthony, L, Waseda University

Cobb, T, University of Quebec at Montreal

Deuchar, M, University of Cambridge

Donnelly, K, N/A

McCarthy, M, The University of Nottingham

Scannell, K, Saint Louis University

Study number / PID

854531 (UKDA)

10.5255/UKDA-SN-854531 (DOI)

Data access

Open

Series

Not available

Abstract

The CorCenCC corpus contains over 11 million words (circa 14.4m tokens). CorCenCC is the first corpus of the Welsh language that covers all three aspects of contemporary Welsh: spoken, written and electronically mediated (e-language). It offers a snapshot of the Welsh language across a range of contexts of use, e.g. private conversations, group socialising, business and other work situations, in education, in the various published media, and in public spaces. It includes examples of news headlines, personal and professional emails and correspondence, academic writing, formal and informal speech, blog posts and text messaging. Language data was sampled from a range of different speakers and users of Welsh, from all regions of Wales, of all ages and genders, with a wide range of occupations, and with a variety of linguistic backgrounds (e.g. how they came to speak Welsh), to reflect the diversity of text types and of Welsh speakers found in contemporary Wales. In this way, the CorCenCC corpus provides the means for empowering users of Welsh to better understand and observe the language across diverse settings, and creates a solid evidence base for the teaching of contemporary Welsh to those who aspire to use it. Over time, the corpus has the potential to make a significant contribution to the transformation of Welsh as the language of public, commercial, education and governmental discourse. A beta version of some bilingual corpus query tools have also been created as part of the CorCenCC project (see Related Resources). These include simple query, full query, frequency list, n-gram, keyword and collocation functionalities. The CorCenCC website also contains Y Tiwtiadur, a collection of data-driven teaching and learning tools designed to help supplement Welsh language learning at all different ages and levels. Y Tiwtiadur contains four distinct corpus-based exercises: Gap Filling (Cloze), Vocabulary Profiler, Word Identification and Word-in-Context. To access this...

Media, communications and languageSociety and culture

LINGUISTICSWELSH (LANGUAGE)PEDAGOGYTEACHINGCOMMUNITIES2021

Data collection period

01/03/2016 - 30/11/2020

Country

Wales

Time dimension

Not available

Analysis unit

Individual

Organization

Family

Family: Household family

Universe

Not available

Sampling procedure

Not available

Kind of data

Text

Data collection mode

A sampling frame was created to underpin the data collection for the project, to ensure that we captured a range of different speakers across different discourse contexts and geographical locations. The sampling frame was designed to reflect current demographics of Welsh speakers to ensure that it reflects the contemporary sociolinguistic situation of the language as accurately as possible.Spoken data was sourced via two main approaches: (i) recruitment of participants to be recorded and (ii) recruitment of participants to contribute spoken data via a novel CorCenCC crowdsourcing app. The scope of (i) included not only research assistants going into the field to record speakers but also participants recording themselves in various interactions. This was facilitated through a network of local 'champions' (active language animateurs in targeted areas) or the Mentrau Iaith (each local authority in Wales has an associated Menter Iaith, i.e. community-based organisation dedicated to raising the profile of the Welsh language local language initiatives). Recruitment for (ii) was achieved by publicising the app (for example through social media, television appearances and publicity materials) to endeavour to reach a different cohort of participants who would be recording individually and in more private domains. Large Welsh language events such as the National Eisteddfod and Tafwyl provided opportunities for the team to reach a large cross-section of participants as well as raise general awareness of the project.The crowdsourcing app was made available on IoS, Android and via a web-interface, and campaigns in the media e.g. appearance on television programmes such as S4C's Prynhawn Da, on both Welsh and English medium radio and through local engagement events. Promotional material included pens, coasters, leaflets and postcard size information sheets. An 'unofficial' mascot - based on a cat called Cor-pws - was designed to facilitate the participation of those under 18 and proved popular with contributors of all ages. Facebook and Twitter accounts for CorCenCC were set up in the first months of the project to further enhance the recruitment and participation of contributors.Novel transcription conventions were devised for processing CorCenCC's spoken data (which was captured via the CorCenCC crowdsourcing app or manually, using audio recording devices). These conventions enabled us to fully reflect the whole spectrum of dialect/register variation captured in our speech data (making them more useful to academic researchers) as well as more accurately representing the speech of participants itself. In terms of written data, the good relationship forged at the beginning of the project with Welsh language publishers such as Gwasg y Lolfa led to the incorporation into the corpus of many up to date novels and books. A unique source of written data in the Welsh language is the locally based Papurau Bro (i.e. local community Welsh-language newspapers). Fairly rapid data capture, for example, sampling from the Welsh language academic journal Gwerddon through the Coleg Cymraeg Cenedlaethol and adult L2 pedagogical resources / examination papers through the Welsh Joint Education Committee resulted from our engagement with other project stakeholders in the planning process for the project.Regarding e-language data, website owners and blog authors cooperated generously and targets were exceeded. Contributors of SMS messages and emails were recruited in the same way as for the spoken data. All relevant participant information and descriptive metadata was recorded at the time of data collection. Permissions to share the data in an online public resource were essential to the development of CorCenCC. These permissions were obtained from the relevant legal entities (e.g. the copyright owner; the speaker themselves) before the data was collected and locally stored. The raw data together with the corresponding permissions and metadata were deposited into a local file storage system.

Grant number

ES/M011348/1

Publisher

UK Data Service

Publication year

2021

Terms of data access

The Data Collection is available from an external repository. Access is available via Related Resources.

Not available

Study title

Creator

Study number / PID

Data access

Series

Abstract

Topics

Keywords

Methodology

Data collection period

Country

Time dimension

Analysis unit

Universe

Sampling procedure

Kind of data

Data collection mode

Funding information

Grant number

Access

Publisher

Publication year

Terms of data access

Related publications