Chipola: A Chinese Podcast Lexical Database for capturing spoken language nuances and predicting behavioral data
database[Title] 2025-05-11
Summary:
This study introduces Chipola, a Chinese Podcast Lexical Database derived from a large-scale collection of Chinese podcast transcripts. Due to the spoken nature of podcasts, such a podcast lexical database can accurately capture the nuances of spoken language in Chinese. Chipola was developed based on a corpus that comprises 31.2 million word tokens and 41.7 million character tokens, featuring a vocabulary of 88,085 unique words and 4,613 unique characters. Lexical variables such as frequency,...