Information & Signal Processing Laboratory

The analysis of biological sequence data and the commercialization of massive bio big data began with Applied Biosystems' Sanger sequencing. As Next-Generation Sequencing (NGS) equipment emerged, the data size became even more massive. Additionally, with the advent of NGS platforms such as Roche, Illumina, PacBio, and SOLiD, along with various analysis tools corresponding to these platforms, bioinformatics personnel have been experiencing significant confusion. The Next-Generation Sequencing (NGS) method for 16S rRNA gene analysis has also transitioned from Roche's 454 (GS-FLX) and GS-Junior to Illumina MiSeq, further increasing the complexity for bioinformatics professionals. To address this issue, we initiated a project aimed at reducing the confusion among laboratory staff and improving work efficiency. As the first step, we developed a Python-based computer vision application that automatically detects the truncate position of read1 and read2 sequences in the DADA2 pipeline, which is used to remove errors from 16S rRNA gene amplicon sequencing results generated by Illumina MiSeq. This application does not directly analyze FASTQ sequence files. Instead, it extracts images from HTML files, which are QC result files from the FastQC pipeline, and automatically recognizes the truncate position. The tool is named "PixelCut", and it requires only two HTML files corresponding to read1 and read2, which are the FastQC result files, as input. The source code is available on GitHub: https://github.com/eastbrain/PixelCut. PixelCut has been tested on Ubuntu Linux 64 servers and Linux 64 on Windows using WSL (Windows Subsystem for Linux).

Welcome to PixelCut Project