Repetitive Elements as Time-Series Genomic Data

            Sridhar Ramachandran+, Travis Doom+, Michael Raymer+ and Dan Krane*
                                        +Department of Computer Science and Engineering
                                                        *Department of Biological Sciences
                                                                   Wright State University

                                    {ramachandran.4, travis.doom, michael.raymer, dan.krane}@wright.edu
 


        About        Experiments        Figures        Algorithms        Further Reading


1. About

            Bioinformatics has provided researchers with a scientific framework to computationally explore, analyze and manage large volumes of genomic data. This inter-disciplinary field strives to determine what information is biologically important and to decipher how it is used. This new science specializes in the analysis of large quantities of biological data, particularly sequence data. Knowledge of when, why and how genetic changes occurred shall help answer several open problems in Bioinformatics. The human DNA has accumulated genomic changes due to mutations over time. A difference in the genomic content of different organism holds the key to their adaptation for survival. To understand the nature of biological diversity it is required to monitor nucleotide sequences present within our genome that allow detailed examination of the mode and pattern of evolution that has shaped our genetic instructions over time spans of millions of year’s.  Sequence alignment techniques and pattern recognition methods require genetic clues as markers to address these unanswered questions.

 

           DNA sequences whose progenitor sequences are known would serve as genetic markers. By recording and decoding the changes in the genetic markers, past events can be traced. Tracing past events shall open the doors to answering several open questions in bioinformatics. Approximately, 98% of the eukaryotic genome is made up of non-coding regions that do not code for proteins. These nontranscribed sequences are believed to be mostly useless, selfish, DNA leftovers from past evolutionary permutations. The late Sozumu Ohno coined the term ‘Junk DNA’ to describe these non-coding sequences. Of the four major kinds of ‘Junk DNA’ that include introns, pseudogenes, satellite sequences and interspersed repeats we discuss about how repetitive elements can be used as time-series data to help solve open problems in Bioinformatics. The paper has been submitted for review at the BIBE 2005 conference.  

Back to Top


2. Experiments                

Alu Progenitor Sequences           Alu Insertion Detection          Alu-within-Alu search

    This section includes detailed explanation of the Experiments discussed in the paper and is intended to be used as supplementary material for the paper. All material required to recreate the experiments is provided herewith and the algorithms are provided in the Algorithms section as well.

 

    The three sets of experiments from the paper are given as three separate subsections as shown above. The hyperlinked Menu can be used to access the experiments directly. 

-------------------------------------------------------------------------------------------------------------------------------------------------------------

A. Alu Progenitor Sequence

    This part of the study used the 213 Alu sequence data provided as supplementary material at the Genome Research website and the 12 remaining sequence data available at the Repbase database. The GC content of each individual Alu source elements with and without CpG masking was calculated to get the GC %. The 225 Alu elements were ranked for statistical analysis. ANOVA and correlation analysis was conducted on the sequence data. The same statistical experiment was repeated with CpG dinucleotides masked on the consensus sequences. The results from this experiment are discussed below.  The Algorithm and data file used in this experiment is included in the Algorithm section.

Observation with CpG dinucleotides included (Spearman’s Rho and Kendall’s Tau): See Figure 7

1. The oldest Alu, Alu J seems to have a near strong positive correlation with the GC %. As more copies were made from this master gene the copies were more GC richer than the original Alu.
2. With Alu S family a very weak negative correlation is observed with GC richness, however since the correlation is very weak it may have been selected against by natural selection.
3. With Alu Y a very very weak negative correlation with GC richness exists signifying that equilibrium has been reached.
4. A descent positive correlation (0.26) exists between Alu J and Alu S indicating a trend towards GC rich Alus
5. A continuing decent positive trend (0.48) between Alu S and Alu Y reconfirms a stronger evolution towards GC rich Alus.
6. A very strong positive trend (0.57) between the oldest Alu (Alu J) and the youngest Alu (Alu Y) shows that younger Alus are significantly GC rich compared to their older counterparts.
7. The same strong positive correlation (0.57) is observed with all three Alu families considered together indicating that Alus have evolved to be GC rich.

 

Observation with CpG dinucleotides included (R square):

1. 39% of the variance in GC in Alu J family can be explained by variation in Age. Likewise, 39% of the variance in Age can be explained by (or goes along with) variation in GC. More simply, 39% of the variance is shared between Alu J and GC.
2. 28% of the variance in GC in Alu S family can be explained by variation in Age. Likewise, 28% of the variance in Age can be explained by (or goes along with) variation in GC. More simply, 28% of the variance is shared between Alu S and GC.
3. 9% of the variance in GC in Alu y family can be explained by variation in Age. Likewise, 9% of the variance in Age can be explained by (or goes along with) variation in GC. More simply, 9% of the variance is shared between Alu J and GC.
4. The variance within Alu families is reducing with age…so newer Alus are not very much different in GC composition.
5. 56% of the variance in GC in Alu J and Alu S family can be explained by variation in Age. Likewise, 56% of the variance in Age can be explained by (or goes along with) variation in GC. More simply, 56% of the variance is shared between Alu J and Alu S and GC.
6. 46% of the variance in GC in Alu S and Alu Y family can be explained by variation in Age. Likewise, 46% of the variance in Age can be explained by (or goes along with) variation in GC. More simply, 46% of the variance is shared between Alu S and Alu Y and GC.
7. 72% of the variance in GC in Alu J and Alu Y family can be explained by variation in Age. Likewise, 72% of the variance in Age can be explained by (or goes along with) variation in GC. More simply, 72% of the variance is shared between Alu J and Alu Y and GC.
8. 62.9% of the variance in GC in Alu family can be explained by variation in Age. Likewise, 62.9% of the variance in Age can be explained by (or goes along with) variation in GC. More simply, 62.9% of the variance is shared between Alu and GC.

Miscellaneous Observations

• Observation on Hoeffding’s Distance correlation shows weak dependencies
• Adjusted R square follows the R square value
• Linear regression plots speak for themselves
• No significant difference was observed between [0] and [1] --- check plots

•The observations with CpG masked is shown in Figure 8 and is self explanatory.

Back to Top

-------------------------------------------------------------------------------------------------------------------------------------------------------------

B. Alu Insertion Detection

    This experiment was conducted in four separate steps as shown in Figure 10. The data and algorithms for each step is given in the Algorithm section.

 

Step 1 : We initially tested a sequence to verify if it contained any Alu sequences. The data and results are provided.

Step 2 : An Alu was inserted into the sequence at a known location. The data and results are provided.

Step 3 : Another Alu was inserted at another known location. The data and results are provided.

Step 4 : Two different Alu-within-Alu events were also inserted at known locations The data and results are provided.

 

    The tool could not recognize the Alu-within-Alu events properly however it did report the Alus as pieces as shown in the algorithm section.

Back to Top

-------------------------------------------------------------------------------------------------------------------------------------------------------------

C. Alu-within-Alu search

    In this experiment we performed a whole genome search for three Alu-within-Alu events as shown in Figure 11 and 12 and in the table provided in the Figures section. Also the algorithm used to search for the Alus in the dump files is included in the algorithms section.

Back to Top


3. Figures

    This section contains three formats for the figures used in the paper. The figures are provided as supplementary material along with the paper for interested readers to download and view the vector images in more finer detail.

 

 

Figure 1 : Alu Element Structure PS   EPS   TIFF
Figure 2 : Alu family time line PS   EPS   TIFF
Figure 3 : Alu amplification models PS   EPS   TIFF
Figure 4 : Alu insertion into poly A tail PS   EPS   TIFF
Figure 5 : Alu elements co-clustering PS   EPS   TIFF
Figure 6 : Alu insertion into middle A rich region PS   EPS   TIFF
Figure 7 : Alu family wise correlation GC % PS   EPS   TIFF
Figure 8 : Alu correlations with CpG masking PS   EPS   TIFF
Figure 9 : Average CpG content of Alus PS   EPS   TIFF
Figure 10 : Alu Element detection experiment PS   EPS   TIFF
Figure 11 : Alu-within-Alu in Y chromosome PS   EPS   TIFF
Figure 12 : Alu-within-Alu occurrences (genome wide) PS   EPS   TIFF
Figure 13 : Full length Alu occurrences PS   EPS   TIFF

        The table below gives the individual count of the Alu-within-Alu events found in the chromosomes as shown in Figure 12. The raw data of all Alu-within-Alu events found in a chromosome can be got by clicking on the chromosome image. Clicking on the sequence type shall give the nucleotide sequences for that polymorphism. Have fun!!

 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
J-Y-J 1749 1106 845 656 869 875 1118 712 808 872 740 923 438 404 617 856 1118 362 1145 467 157 450 728 136
S-Y-S 625 361 315 236 337 328 430 245 269 328 265 369 166 133 222 368 409 134 473 171 79 166 280 57
S-Y-S 963 532 404 295 348 387 633 335 396 438 353 526 170 199 339 533 707 141 802 270 81 300 348 71

 Back to Top


 4. Algorithms

    This section contains the algorithm(s) and data files discussed in the Experiments section. The algorithms were used on Linux and have not been tested on any other OS. Click on the hyperlinks to download the files. Have fun!!

 

  A. Alu Progenitor Sequence  data algorithm            
                   
  B. Alu Insertion detection data_1 data_2 data_3 data_4 result_1 result_2 result_3 result_4
                   
  C. Alu-within-Alu search algorithm              
                   

Back to Top


5. Further Reading (Suggested)

  1. Y-Y. Hsieh, I-P. Chan, H-I. Wang, C-C. Chang, C-W. Huang, and C-S. Lin, “ PROGINS Alu sequence insertion is associated with hyperprolactinaemia but not leiomyoma susceptibility”, Clinical Endocrionology, vol. 62, 2005, pp. 492 – 497.

  2. D. Graur, “ Can junk DNA be exapted?”, <http http://neuron.tau.ac.il/~horn/bat7/presentations/graur.ppt >.

  3. U.S. Department of Energy Genome Programs, “ Genomics and its impact on Science and Society, The Human Genome Project and Beyond ”, <http://www.ornl.gov/hgmis>.

  4. J. Gilder, D.E. Krane, T.E. Doom, and M.L. Raymer, “ Identifying Patterns in DNA Change”, Proceedings of the 2003 Midwest Artificial Intelligence and Cognitive Science Conference, vol. 34, April 2003, pp. 78-84. Columbus OH.

  5. G.B. Golding, “DNA and the revolutions of molecular evolution, computational biology, and bioinformatics”, Genome, 46, 2003, pp.930 – 935.

  6. T. Doom, M. Raymer, and D. Krane, “Bioinformatics”, IEEE Potentials, February/March 2004, pp.24 – 27.

  7. Flash Animation of Alu amplification <http://www.geneticorigins.org/geneticorigins/pv92/media4.html>

  8. K. Hammarstrom, G. Westin, C. Bark, J. Zabielski, and U. Petterson, “Genes and pseudogenes for human U2 RNA. Implications for the mechanism of pseu-dogene formation”, Journal of Mol. Bio., vol. 179, 1984, pp. 157 – 169.

  9. D. Stoppa-Lyonnet, P.E. Carter, T. Meo, and M. Tosi, “Clusters of intragenic Alu repeats predispose the hu-man C1 inhibitor locus to deleterious rearrangements” Proc. Natl. Acad. Sci., vol. 87, February 1990, pp. 1551 – 1555.

  10. National Center for Biotechnology Information (NCBI), January 27, 2005, <ftp://ftp.ncbi.nih.gov/genbank/>.

     

Back to Top


Designed by Sridhar Ramachandran and Dr. Travis E Doom (06/01/2005)

This Page has been viewed times