Chipola: A Chinese Podcast Lexical Database for capturing spoken language nuances and predicting behavioral data
database[Title] 2025-05-11
Behav Res Methods. 2025 May 8;57(6):166. doi: 10.3758/s13428-025-02697-0.
ABSTRACT
This study introduces Chipola, a Chinese Podcast Lexical Database derived from a large-scale collection of Chinese podcast transcripts. Due to the spoken nature of podcasts, such a podcast lexical database can accurately capture the nuances of spoken language in Chinese. Chipola was developed based on a corpus that comprises 31.2 million word tokens and 41.7 million character tokens, featuring a vocabulary of 88,085 unique words and 4,613 unique characters. Lexical variables such as frequency, context diversity, and part-of-speech information are also included. Findings of interest are as follows. First, Chipola captures the spoken Chinese features, such as the core spoken vocabulary. Second, it outperforms other lexical databases in predicting third-party behavioral data. Third, its rich text-level information enables educators to simulate Chinese lexical input on daily podcast listening, which provides pedagogical insights for the overall effects of language exposure. To summarize, Chipola presents an innovative and valuable resource with significant implications and applications in areas such as psychology and language education.
PMID:40341999 | DOI:10.3758/s13428-025-02697-0