Algorithm

Algorithm of AF-1

Splinting:

Let  S be a sequence of 282 characters, which define the Alu Sx subfamily and Qm be the sequence of m characters, which define the query sequence.

Sset(alu) = { S1-S6, S2-S7, S3-S8,....,S276-S282}, which is a set of overlapping subsequences derived from the Alu Sx subfamily.

Similarly another set is derived from the query sequence as Sset(query) = {Q1-Q6,Q2-Q7,...,Qm-6,Qm}.

For i=0 from  Sset(alu):
                        if Sset,i ∈  Sset(query)
                        set i = seed for Alu
                        if i > 300:
                                    AR = Qi-300 - Qi+300
                        else:
                                    AR = Q1-Qi+300
                                Repeat for all values of i and define Alu regions in non-overlapping manner.

Where AR = Alu Probable Region with size l;

Alignment between Alu Sx sequence S and AR:
1. Initialize all cells of alignment matrix M{i*j} equal to 0 with maximum value for i=l and j=Alulength.

2. For i=1 till i=282 and j=1 till j=l:
            if  AR,i == Sj, assign Mi,j=1

3. Search for longest diagonal region (LDR) of 1 i.e.

4. Anchor the rest of alignment process in the diagonal region left upside(LU) and diagonal region at bottom right(BR).

5. Global alignment is selected for the smaller subregion between LU and BR while NW local is employed for larger.

6. If similarity score of overall alignment crosses threshold Talign, select the aligned region as Alu.

7. Else scan for all overlapping stretched of thirty window size for similarity crossing Talign and if returns true, subject AR for probabilistic scanning, otherwise reject the region as Alu like.

Probabilistic Scanning:
About 5000 Alu sequences comprising all possible Alu subfamilies, a multiple sequence alignment was generated. From this alignment, position weight matrix, Mat_Alu for Alu was derived for five different states: A,T,G, and Indel.

This way for every overlapping position R in the aligned query sequence, PR, is estimated and placed in the sequence. Values crossing some threshold '”” are considered as potential Alu like region. The Threshold value has been derived from training data over Alu and non Alu sequences. Usually a region giving monotonously increasing values of PR followed by near consistent values of  PR crossing threshold values followed by monotonously decreasing  PR values for subsequent regions mark the probable Alu type SINE region in the sequence.

Diagnostic nucleotides anchored encoded sequence alignment based classification:

Classification of Alu elements require care as classifying them simply on the basis of alignment score could be misleading. This is because of the fact that most of the regions of almost all Alu subfamilies are very similar to each other as well as not every position in the sequences are deterministic of about the subfamilies of Alu. Based upon only certain diagnostic positions different Alu subfamilies have been classified.

This all implies that a simple alignment approach for classification may be misleading in classifying Alus correctly and one needs to give more weights to subfamilies specific diagnostic positions instead of flat weighting in alignments. We have achieved this through our program which utilizes the family specific pattern encode generation for classification which retain the minutest of family specific details like spacing between various positions and pattern landmarks besides pickign up diagnostic positions.

Suppose 'S' be the sequence to be classified where,
S = S1S2S3.......SN and S' = S'1S'2S'3.......S'N

Let S' be the reference string to which all sequences including S are aligned to derive encoded patterns. 213 different classes of Alu sequences too align to this reference sequence and generate encoded pattern specific to the subfamily. N is the length of the alignment.

For i=1 untill N:
            if Si == S'i:
                        Put the weight at i equal to non-weighted position or 0. Mask nucleotide at i.
            elif  Si =/= S'i:
                        Put higher weight for S at i and retain Si nucleotide at ith position.

Output encoded sequence of masked and unmasked characters with weights i.e.  Wq= W1W2....WN
Repeat for all 213 Alu subfamilies and generate Wi = {W1,....,W213}.

Now compare Wq with every family encoded pattern Wi through alignments using matrix developed for simple non nucleotide characters mixed with weighted position specific nucleotides. The alignments at non-diagnostic position this way don't influence the alignment and thus classification and an overall better classification is obtained. Higher weighting of diagnostic positions anchor the sequence alignment and avoid generalized alignment.


@ 2008, Indian Institute Of Advanced Research