Sprecher
Beschreibung
Metabarcoding of the 16S rRNA gene is widely used to assess microbial diversity due to its cost-effectiveness and efficiency. However, publicly available 16S rRNA metabarcoding datasets often lack standardized metadata, particularly information on the sequenced hypervariable regions or primers used, which are critical to their accurate reuse. To address this, we present HVRLocator, a computational tool that (1) identifies the start and end positions of 16S rRNA amplicons, (2) determines their corresponding hypervariable regions, and (3) detects the presence of primer sequences. This tool was validated on four datasets comprising 41,513 samples generated with different primers and sequencing platforms.
HVRLocator can process archived 16S rRNA sequences from NCBI SRA at an average rate of 6.5 samples per minute. Validation showed it reliably detects amplicon start and end positions across datasets sequenced with different primers and platforms, achieving 100% accuracy within single-platform studies and correctly revealing length heterogeneity across platforms. It also flagged misannotated metadata and problematic sequences, underscoring its value as a sequence data curation tool. Finally, HVRLocator can select comparable sequences to build large 16S rRNA amplicon databases spanning the same hypervariable region, facilitating cross-study comparisons.
In conclusion, this tool overcomes unreliable metadata by accurately identifying 16S rRNA amplicon start and end positions, determining hypervariable regions, and detecting primer sequences, thereby enabling accurate curation and large-scale processing of 16S rRNA data for reliable and reproducible microbial studies, syntheses, and meta-analyses.