A simple measure of dissimilarity between secondary structures of equal length is the base pair distance, given by the number of pairs present in only one of the two structures being compared. I.e. the number of base pairs that have to be opened or closed to transform one structure into the other. It is therefore particularly useful for comparing structures on the same sequence. It is implemented by
int bp_distance(const char *str1, const char *str2)
Compute the "base pair" distance between two secondary structures s1 and s2.
For other cases a distance measure that allows for gaps is preferable. We can define distances between structures as edit distances between trees or their string representations. In the case of string distances this is the same as "sequence alignment". Given a set of edit operations and edit costs, the edit distance is given by the minimum sum of the costs along an edit path converting one object into the other. Edit distances like these always define a metric. The edit operations used by us are insertion, deletion and replacement of nodes. String editing does not pay attention to the matching of brackets, while in tree editing matching brackets represent a single node of the tree. Tree editing is therefore usually preferable, although somewhat slower. String edit distances are always smaller or equal to tree edit distances.
The different level of detail in the structure representations defined above naturally leads to different measures of distance. For full structures we use a cost of 1 for deletion or insertion of an unpaired base and 2 for a base pair. Replacing an unpaired base for a pair incurs a cost of 1.
Two cost matrices are provided for coarse grained structures:
/* Null, H, B, I, M, S, E */ { 0, 2, 2, 2, 2, 1, 1}, /* Null replaced */ { 2, 0, 2, 2, 2, INF, INF}, /* H replaced */ { 2, 2, 0, 1, 2, INF, INF}, /* B replaced */ { 2, 2, 1, 0, 2, INF, INF}, /* I replaced */ { 2, 2, 2, 2, 0, INF, INF}, /* M replaced */ { 1, INF, INF, INF, INF, 0, INF}, /* S replaced */ { 1, INF, INF, INF, INF, INF, 0}, /* E replaced */ /* Null, H, B, I, M, S, E */ { 0, 100, 5, 5, 75, 5, 5}, /* Null replaced */ { 100, 0, 8, 8, 8, INF, INF}, /* H replaced */ { 5, 8, 0, 3, 8, INF, INF}, /* B replaced */ { 5, 8, 3, 0, 8, INF, INF}, /* I replaced */ { 75, 8, 8, 8, 0, INF, INF}, /* M replaced */ { 5, INF, INF, INF, INF, 0, INF}, /* S replaced */ { 5, INF, INF, INF, INF, INF, 0}, /* E replaced */
The lower matrix uses the costs given in [21]. All distance functions use the following global variables:
int cost_matrix;
Specify the cost matrix to be used for distance calculations.
int edit_backtrack;
Produce an alignment of the two structures being compared by tracing the editing path giving the minimum distance.
char *aligned_line[4];
Contains the two aligned structures after a call to one of the distance functions with edit_backtrack set to 1.
Tree *make_tree (char *struc)
Constructs a Tree ( essentially the postorder list ) of the structure 'struc', for use in tree_edit_distance().
float tree_edit_distance (Tree *T1, Tree *T2)
Calculates the edit distance of the two trees.
void free_tree(Tree *t)
Free the memory allocated for Tree t.
swString *Make_swString (char *string)
Convert a structure into a format suitable for string_edit_distance().
float string_edit_distance (swString *T1, swString *T2)
Calculate the string edit distance of T1 and T2.
For comparison of base pair probability matrices, the matrices are first condensed into probability profiles which are the compared by alignment.
float *Make_bp_profile_bppm ( double *bppm, int length)
condense pair probability matrix into a vector containing probabilities for unpaired, upstream paired and downstream paired.
float profile_edit_distance ( const float *T1, const float *T2)
Align the 2 probability profiles T1, T2
.