PolyA_DB v4: systematic polyA site identification and isoform annotation in human and mouse genomes using 3' end and long-read sequencing data
(database[TitleAbstract]) AND (Nucleic acids research[Journal]) 2026-01-23
Nucleic Acids Res. 2026 Jan 6;54(D1):D247-D254. doi: 10.1093/nar/gkaf1212.
ABSTRACT
The cleavage and polyadenylation site (PAS) defines the 3' end of almost all protein-coding and long non-coding RNAs in eukaryotes. Most genes harbor multiple PAS, resulting in expression of alternative polyadenylation (APA) isoforms. Here, we present PolyA_DB version 4 (https://exon.apps.wistar.org/polya_db/v4/), an updated database dedicated to PAS in mammalian genomes. By exhaustive mining of human and mouse transcriptomic data sets generated by the 3' region extraction and deep sequencing plus (3'READS+) method, corresponding to ∼2.3 billion PAS-supporting reads for each species, we identify ∼1.4 million PAS in both human and mouse genomes, increasing PAS coverage over the last database version by 4.9- and 3.5-fold, respectively. Of the full PAS set (named Max collection), 20% of them match the transcript end sites (TES) of public long-read RNA sequencing (LR-RNA-seq) data. Notably, ∼10%-20% of LR-RNA-seq TES do not match our annotated PAS, suggesting 3' end artifacts derived plausibly from internal A-rich regions of RNA. However, LR-RNA-seq data substantially complement RefSeq-based assignment of PAS to genes and are highly valuable in subtyping APA events in the context of splicing configuration. PolyA_DB v4 also contains PAS conservation and PAS strength information and is linked to UCSC Genome Browser for data visualization.
PMID:41316728 | PMC:PMC12807684 | DOI:10.1093/nar/gkaf1212