Description

Pairwise sequence alignment is a fundamental operation in bioinformatics. Gotoh's algorithm for this purpose is being widely used and perhaps more importantly, implemented. A plethora of distinct formulations exist for this algorithm which, as we show, cause confusion in the field. More importantly, this confusion leads to numerous implementation issues and errors, which are not known to the typical end-user. Foremost, we point out two mathematical errors in Gotoh's paper from 1982. First, there are minor mistakes in the indices of the dynamic programming algorithm. Second, we describe a more severe problem with the initialization of the dynamic programming matrix for global sequence alignment. While the index issue becomes apparent immediately when implementing the procedure, the initialization issue is more involved and can easily be overseen. This observation is corroborated by several text books and lecture notes, where the initialization issue is either present, or circumvented via additional assumptions. The above two issues can, as we show, generate incorrectly aligned sequences. Furthermore, five out of ten implementations that we analyzed yield sub-optimal alignments as well. Finally, we provide a correct version of the algorithm.

Available at: qualign.tar.gz