Bioinformatics

Writing a range of algorithms for Bioinformatics

DNA Alignment

This takes two DNA sequences and produces the optimal alignment. This is done by filling out a backtracking matrix.

This has a simple piece of driver code which loops through all the cells of the backtrack matrix

for a in range(1,len(seq1)+1):
    for b in range(1,len(seq2)+1):
        fill_cell(m,backtrack,seq1,seq2,a,b)

The fill_cell function then determines the entry in the backtracking matrix and the maximum score of a matching

def fill_cell(m,backtrack,seq1,seq2,a,b):
    # Here m is the matrix and [b,a] is the location of the cell to fill out
    # Diagonal
    max=score(a,b,seq1,seq2)+m[b-1,a-1]
    pos='D'
    # Up
    temp=m[b-1,a]-2
    if temp>max:
        max=temp
        pos='U'
    # Left
    temp=m[b,a-1]-2
    if temp>max:
        max=temp
        pos='L'
    m[b,a]=max
    backtrack[b,a]=pos
    return

The rules for scoring were given in the assignment, and are implemented using the score function

def score(a,b,seq1,seq2):
    # This function finds the score of the matching
    index_a=a-1
    index_b=b-1
    if seq1[index_a]==seq2[index_b]:
        if seq1[index_a]=='A':
            return 4
        if seq1[index_a]=='C':
            return 3
        if seq1[index_a]=='G':
            return 2
        if seq1[index_a]=='T':
            return 1
        else:
            print(str(seq1[index_a])+str(seq2[index_b])+str(index_a)+str(index_b))
    else:
        return -3

The best alignment is then generated using the following function by moving through the backtracking matrix

def gen_seq(backtrack,str1,str2):
    coord=(len(str2),len(str1))
    matchstring1=''
    matchstring2=''
    while coord!=(0,0):
        if backtrack[coord]=='D':
            coord=(coord[0]-1,coord[1]-1)
            matchstring1=matchstring1+str1[-1]
            str1=str1[:-1]
            matchstring2 = matchstring2 + str2[-1]
            str2 = str2[:-1]
        if backtrack[coord]=='U':
            coord=(coord[0]-1,coord[1])
            matchstring1 = matchstring1 + '-'
            matchstring2 = matchstring2 + str2[-1]
            str2 = str2[:-1]
        if backtrack[coord]=='L':
                coord=(coord[0],coord[1]-1)
                matchstring1 = matchstring1 + str1[-1]
                str1 = str1[:-1]
                matchstring2 = matchstring2 + '-'
    matchstring1=matchstring1[::-1]
    matchstring2 = matchstring2[::-1]
    return [matchstring1,matchstring2]

Drawing a phylogenetic tree

This task is to generate a phylogenetic tree based on an input distance matrix

For brevity I have removed the code which formats the input from the file so that it can be worked on along with the code to output the tree.

def WPGMA(filename):
    while table.shape != (2, 2):
        # Find the minimum value
        minval = np.min(table[np.nonzero(table)])
        # Find its coordinates
        itemindex = np.where(table == minval)
        # Merge the two species corresponding to those coordinates
        table = mergespecies(table, itemindex[0][0], itemindex[0][1], names, names2)
        nametable=[str(name) for name in names2]
        stack1=np.array(nametable)
        stacktotal=np.vstack((stack1,table))
        print(stacktotal)

And the function to merge species is as follows:

def mergespecies(table, a, b, names, names2):
    depth = lambda L: (isinstance(L, list) and (max(map(depth, L)) + 1) if L else 1) or 0
    if depth(names[a])<depth(names[b]):
        names[a],names[b]=names[b],names[a]
    # print("Merge:" + str(names[a]) + " and " + str(names[b]))
    graphmerge(names[a],names[b])
    names2[a]=names2[a]+names2[b]
    names2.remove(names2[b])
    sublist = [names[a], names[b]]
    names.remove(names[b])
    names.remove(names[a])
    names.insert(a, sublist)
    column1 = table[:, a:a + 1]
    column2 = table[:, b:b + 1]
    # Combine the two columns
    combine = np.hstack((column1, column2))
    # Get the mean of all the rows
    combine = combine.reshape(-1, 2).mean(axis=1).reshape(combine.shape[0], -1)
    # Delete the rows corresponding to the two species
    combine = np.delete(combine, [a, b], 0)
    # Delete the rows/columns corresponding to the two species from the main table
    table = np.delete(table, [a, b], 0)
    table = np.delete(table, [a, b], 1)
    # Append the amended column to the main table
    combine = np.transpose(combine)
    table = np.insert(table, min(a, b), combine, axis=1)
    # Add a zero the the column and transpose it
    combine = np.insert(combine, min(a, b), [0], axis=1)
    # Add this row to the bottom of the main table
    table = np.insert(table, min(a, b), combine, axis=0)
    # Return the table back to the main function
    return table