The catalogue contains study descriptions in various languages. The system searches with your search terms from study descriptions available in the language you have selected. The catalogue does not have ‘All languages’ option as due to linguistic differences this would give incomplete results. See the User Guide for more detailed information.
The national corpus of contemporary Welsh, 2016-2020
Creator
Knight, D, Cardiff University
Morris, S, Swansea University
Fitzpatrick, T, Swansea University
Rayson, P, Lancaster University
Spasić, I, Cardiff University
Thomas, E, Bangor University
Lovell, A, Swansea University
Morris, J, Cardiff University
Evas, J, Cardiff University
Stonelake, M, Swansea University
Arman, L, Cardiff University
Davies, J, Bangor University
Ezeani, I, Lancaster University
Neale, S, Cardiff University
Needs, J, Swansea University
Piao, S, Lancaster University
Rees, M, Swansea University
Watkins, G, Cardiff University
Williams, L, Cardiff University
Muralidaran, V, Cardiff University
Tovey-Walsh, B, Swansea University
Anthony, L, Waseda University
Cobb, T, University of Quebec at Montreal
Deuchar, M, University of Cambridge
Donnelly, K, N/A
McCarthy, M, The University of Nottingham
Scannell, K, Saint Louis University
Study number / PID
854531 (UKDA)
10.5255/UKDA-SN-854531 (DOI)
Data access
Open
Series
Not available
Abstract
The CorCenCC corpus contains over 11 million words (circa 14.4m tokens). CorCenCC is the first corpus of the Welsh language that covers all three aspects of contemporary Welsh: spoken, written and electronically mediated (e-language). It offers a snapshot of the Welsh language across a range of contexts of use, e.g. private conversations, group socialising, business and other work situations, in education, in the various published media, and in public spaces. It includes examples of news headlines, personal and professional emails and correspondence, academic writing, formal and informal speech, blog posts and text messaging. Language data was sampled from a range of different speakers and users of Welsh, from all regions of Wales, of all ages and genders, with a wide range of occupations, and with a variety of linguistic backgrounds (e.g. how they came to speak Welsh), to reflect the diversity of text types and of Welsh speakers found in contemporary Wales. In this way, the CorCenCC corpus provides the means for empowering users of Welsh to better understand and observe the language across diverse settings, and creates a solid evidence base for the teaching of contemporary Welsh to those who aspire to use it. Over time, the corpus has the potential to make a significant contribution to the transformation of Welsh as the language of public, commercial, education and governmental discourse.
A beta version of some bilingual corpus query tools have also been created as part of the CorCenCC project (see Related Resources). These include simple query, full query, frequency list, n-gram, keyword and collocation functionalities. The CorCenCC website also contains Y Tiwtiadur, a collection of data-driven teaching and learning tools designed to help supplement Welsh language learning at all different ages and levels. Y Tiwtiadur contains four distinct corpus-based exercises: Gap Filling (Cloze), Vocabulary Profiler, Word Identification and Word-in-Context. To access this...
Terminology used is generally based on DDI controlled vocabularies: Time Method, Analysis Unit, Sampling Procedure and Mode of Collection, available at CESSDA Vocabulary Service.
Methodology
Data collection period
01/03/2016 - 30/11/2020
Country
Wales
Time dimension
Not available
Analysis unit
Individual
Organization
Family
Family: Household family
Universe
Not available
Sampling procedure
Not available
Kind of data
Text
Data collection mode
A sampling frame was created to underpin the data collection for the project, to ensure that we captured a range of different speakers across different discourse contexts and geographical locations. The sampling frame was designed to reflect current demographics of Welsh speakers to ensure that it reflects the contemporary sociolinguistic situation of the language as accurately as possible.Spoken data was sourced via two main approaches: (i) recruitment of participants to be recorded and (ii) recruitment of participants to contribute spoken data via a novel CorCenCC crowdsourcing app. The scope of (i) included not only research assistants going into the field to record speakers but also participants recording themselves in various interactions. This was facilitated through a network of local 'champions' (active language animateurs in targeted areas) or the Mentrau Iaith (each local authority in Wales has an associated Menter Iaith, i.e. community-based organisation dedicated to raising the profile of the Welsh language local language initiatives). Recruitment for (ii) was achieved by publicising the app (for example through social media, television appearances and publicity materials) to endeavour to reach a different cohort of participants who would be recording individually and in more private domains. Large Welsh language events such as the National Eisteddfod and Tafwyl provided opportunities for the team to reach a large cross-section of participants as well as raise general awareness of the project.The crowdsourcing app was made available on IoS, Android and via a web-interface, and campaigns in the media e.g. appearance on television programmes such as S4C's Prynhawn Da, on both Welsh and English medium radio and through local engagement events. Promotional material included pens, coasters, leaflets and postcard size information sheets. An 'unofficial' mascot - based on a cat called Cor-pws - was designed to facilitate the participation of those under 18 and proved popular with contributors of all ages. Facebook and Twitter accounts for CorCenCC were set up in the first months of the project to further enhance the recruitment and participation of contributors.Novel transcription conventions were devised for processing CorCenCC's spoken data (which was captured via the CorCenCC crowdsourcing app or manually, using audio recording devices). These conventions enabled us to fully reflect the whole spectrum of dialect/register variation captured in our speech data (making them more useful to academic researchers) as well as more accurately representing the speech of participants itself. In terms of written data, the good relationship forged at the beginning of the project with Welsh language publishers such as Gwasg y Lolfa led to the incorporation into the corpus of many up to date novels and books. A unique source of written data in the Welsh language is the locally based Papurau Bro (i.e. local community Welsh-language newspapers). Fairly rapid data capture, for example, sampling from the Welsh language academic journal Gwerddon through the Coleg Cymraeg Cenedlaethol and adult L2 pedagogical resources / examination papers through the Welsh Joint Education Committee resulted from our engagement with other project stakeholders in the planning process for the project.Regarding e-language data, website owners and blog authors cooperated generously and targets were exceeded. Contributors of SMS messages and emails were recruited in the same way as for the spoken data. All relevant participant information and descriptive metadata was recorded at the time of data collection. Permissions to share the data in an online public resource were essential to the development of CorCenCC. These permissions were obtained from the relevant legal entities (e.g. the copyright owner; the speaker themselves) before the data was collected and locally stored. The raw data together with the corresponding permissions and metadata were deposited into a local file storage system.
Funding information
Grant number
ES/M011348/1
Access
Publisher
UK Data Service
Publication year
2021
Terms of data access
The Data Collection is available from an external repository. Access is available via Related Resources.