Repetitive Elements as
Time-Series Genomic Data
Sridhar
Ramachandran+,
Travis Doom+,
Michael
Raymer+ and
Dan Krane*
+Department of
Computer Science and Engineering
*Department of
Biological Sciences
Wright State University
{ramachandran.4, travis.doom, michael.raymer, dan.krane}@wright.edu
About Experiments Figures Algorithms Further Reading
Bioinformatics has provided researchers with a scientific framework to computationally explore, analyze and manage large volumes of genomic data. This inter-disciplinary field strives to determine what information is biologically important and to decipher how it is used. This new science specializes in the analysis of large quantities of biological data, particularly sequence data. Knowledge of when, why and how genetic changes occurred shall help answer several open problems in Bioinformatics. The human DNA has accumulated genomic changes due to mutations over time. A difference in the genomic content of different organism holds the key to their adaptation for survival. To understand the nature of biological diversity it is required to monitor nucleotide sequences present within our genome that allow detailed examination of the mode and pattern of evolution that has shaped our genetic instructions over time spans of millions of year’s. Sequence alignment techniques and pattern recognition methods require genetic clues as markers to address these unanswered questions.
DNA sequences whose progenitor sequences are known would serve as genetic markers. By recording and decoding the changes in the genetic markers, past events can be traced. Tracing past events shall open the doors to answering several open questions in bioinformatics. Approximately, 98% of the eukaryotic genome is made up of non-coding regions that do not code for proteins. These nontranscribed sequences are believed to be mostly useless, selfish, DNA leftovers from past evolutionary permutations. The late Sozumu Ohno coined the term ‘Junk DNA’ to describe these non-coding sequences. Of the four major kinds of ‘Junk DNA’ that include introns, pseudogenes, satellite sequences and interspersed repeats we discuss about how repetitive elements can be used as time-series data to help solve open problems in Bioinformatics. The paper has been submitted for review at the BIBE 2005 conference.
Alu Progenitor Sequences Alu Insertion Detection Alu-within-Alu search
This section includes detailed explanation of the Experiments discussed in the paper and is intended to be used as supplementary material for the paper. All material required to recreate the experiments is provided herewith and the algorithms are provided in the Algorithms section as well.
The three sets of experiments from the paper are given as three separate subsections as shown above. The hyperlinked Menu can be used to access the experiments directly.
-------------------------------------------------------------------------------------------------------------------------------------------------------------
This part of the study used the 213 Alu sequence data provided as supplementary material at the Genome Research website and the 12 remaining sequence data available at the Repbase database. The GC content of each individual Alu source elements with and without CpG masking was calculated to get the GC %. The 225 Alu elements were ranked for statistical analysis. ANOVA and correlation analysis was conducted on the sequence data. The same statistical experiment was repeated with CpG dinucleotides masked on the consensus sequences. The results from this experiment are discussed below. The Algorithm and data file used in this experiment is included in the Algorithm section.
Observation with CpG dinucleotides included (Spearman’s Rho
and Kendall’s Tau): See Figure 7
1. The oldest Alu, Alu J seems to have a near strong positive correlation with
the GC %. As more copies were made from this master gene the copies were more GC
richer than the original Alu.
2. With Alu S family a very weak negative correlation is observed with GC
richness, however since the correlation is very weak it may have been selected
against by natural selection.
3. With Alu Y a very very weak negative correlation with GC richness exists
signifying that equilibrium has been reached.
4. A descent positive correlation (0.26) exists between Alu J and Alu S
indicating a trend towards GC rich Alus
5. A continuing decent positive trend (0.48) between Alu S and Alu Y reconfirms
a stronger evolution towards GC rich Alus.
6. A very strong positive trend (0.57) between the oldest Alu (Alu J) and the
youngest Alu (Alu Y) shows that younger Alus are significantly GC rich compared
to their older counterparts.
7. The same strong positive correlation (0.57) is observed with all three Alu
families considered together indicating that Alus have evolved to be GC rich.
Observation with CpG dinucleotides included (R square):
1. 39% of the variance in GC in Alu J family can be explained by variation in
Age. Likewise, 39% of the variance in Age can be explained by (or goes along
with) variation in GC. More simply, 39% of the variance is shared between Alu J
and GC.
2. 28% of the variance in GC in Alu S family can be explained by variation in
Age. Likewise, 28% of the variance in Age can be explained by (or goes along
with) variation in GC. More simply, 28% of the variance is shared between Alu S
and GC.
3. 9% of the variance in GC in Alu y family can be explained by variation in
Age. Likewise, 9% of the variance in Age can be explained by (or goes along
with) variation in GC. More simply, 9% of the variance is shared between Alu J
and GC.
4. The variance within Alu families is reducing with age…so newer Alus are not
very much different in GC composition.
5. 56% of the variance in GC in Alu J and Alu S family can be explained by
variation in Age. Likewise, 56% of the variance in Age can be explained by (or
goes along with) variation in GC. More simply, 56% of the variance is shared
between Alu J and Alu S and GC.
6. 46% of the variance in GC in Alu S and Alu Y family can be explained by
variation in Age. Likewise, 46% of the variance in Age can be explained by (or
goes along with) variation in GC. More simply, 46% of the variance is shared
between Alu S and Alu Y and GC.
7. 72% of the variance in GC in Alu J and Alu Y family can be explained by
variation in Age. Likewise, 72% of the variance in Age can be explained by (or
goes along with) variation in GC. More simply, 72% of the variance is shared
between Alu J and Alu Y and GC.
8. 62.9% of the variance in GC in Alu family can be explained by variation in
Age. Likewise, 62.9% of the variance in Age can be explained by (or goes along
with) variation in GC. More simply, 62.9% of the variance is shared between Alu
and GC.
Miscellaneous Observations
• Observation on Hoeffding’s Distance correlation shows weak dependencies
• Adjusted R square follows the R square value
• Linear regression plots speak for themselves
• No significant difference was observed between [0] and [1] --- check plots
•The observations with CpG masked is shown in Figure 8 and is self explanatory.
-------------------------------------------------------------------------------------------------------------------------------------------------------------
This experiment was conducted in four separate steps as shown in Figure 10. The data and algorithms for each step is given in the Algorithm section.
Step 1 : We initially tested a sequence to verify if it contained any Alu sequences. The data and results are provided.
Step 2 : An Alu was inserted into the sequence at a known location. The data and results are provided.
Step 3 : Another Alu was inserted at another known location. The data and results are provided.
Step 4 : Two different Alu-within-Alu events were also inserted at known locations The data and results are provided.
The tool could not recognize the Alu-within-Alu events properly however it did report the Alus as pieces as shown in the algorithm section.
-------------------------------------------------------------------------------------------------------------------------------------------------------------
In this experiment we performed a whole genome search for three Alu-within-Alu events as shown in Figure 11 and 12 and in the table provided in the Figures section. Also the algorithm used to search for the Alus in the dump files is included in the algorithms section.
This section contains three formats for the figures used in the paper. The figures are provided as supplementary material along with the paper for interested readers to download and view the vector images in more finer detail.
| Figure 1 | : Alu Element Structure | PS EPS TIFF |
| Figure 2 | : Alu family time line | PS EPS TIFF |
| Figure 3 | : Alu amplification models | PS EPS TIFF |
| Figure 4 | : Alu insertion into poly A tail | PS EPS TIFF |
| Figure 5 | : Alu elements co-clustering | PS EPS TIFF |
| Figure 6 | : Alu insertion into middle A rich region | PS EPS TIFF |
| Figure 7 | : Alu family wise correlation GC % | PS EPS TIFF |
| Figure 8 | : Alu correlations with CpG masking | PS EPS TIFF |
| Figure 9 | : Average CpG content of Alus | PS EPS TIFF |
| Figure 10 | : Alu Element detection experiment | PS EPS TIFF |
| Figure 11 | : Alu-within-Alu in Y chromosome | PS EPS TIFF |
| Figure 12 | : Alu-within-Alu occurrences (genome wide) | PS EPS TIFF |
| Figure 13 | : Full length Alu occurrences | PS EPS TIFF |
The table below gives the individual count of the Alu-within-Alu events found in the chromosomes as shown in Figure 12. The raw data of all Alu-within-Alu events found in a chromosome can be got by clicking on the chromosome image. Clicking on the sequence type shall give the nucleotide sequences for that polymorphism. Have fun!!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | X | Y | |
| J-Y-J | 1749 | 1106 | 845 | 656 | 869 | 875 | 1118 | 712 | 808 | 872 | 740 | 923 | 438 | 404 | 617 | 856 | 1118 | 362 | 1145 | 467 | 157 | 450 | 728 | 136 |
| S-Y-S | 625 | 361 | 315 | 236 | 337 | 328 | 430 | 245 | 269 | 328 | 265 | 369 | 166 | 133 | 222 | 368 | 409 | 134 | 473 | 171 | 79 | 166 | 280 | 57 |
| S-Y-S | 963 | 532 | 404 | 295 | 348 | 387 | 633 | 335 | 396 | 438 | 353 | 526 | 170 | 199 | 339 | 533 | 707 | 141 | 802 | 270 | 81 | 300 | 348 | 71 |
This section contains the algorithm(s) and data files discussed in the Experiments section. The algorithms were used on Linux and have not been tested on any other OS. Click on the hyperlinks to download the files. Have fun!!
| A. Alu Progenitor Sequence | data | algorithm | |||||||
| B. Alu Insertion detection | data_1 | data_2 | data_3 | data_4 | result_1 | result_2 | result_3 | result_4 | |
| C. Alu-within-Alu search | algorithm | ||||||||
5. Further Reading (Suggested)
Y-Y. Hsieh, I-P. Chan, H-I. Wang, C-C. Chang, C-W. Huang, and C-S. Lin, “ PROGINS Alu sequence insertion is associated with hyperprolactinaemia but not leiomyoma susceptibility”, Clinical Endocrionology, vol. 62, 2005, pp. 492 – 497.
D. Graur, “ Can junk DNA be exapted?”, <http http://neuron.tau.ac.il/~horn/bat7/presentations/graur.ppt >.
U.S. Department of Energy Genome Programs, “ Genomics and its impact on Science and Society, The Human Genome Project and Beyond ”, <http://www.ornl.gov/hgmis>.
J. Gilder, D.E. Krane, T.E. Doom, and M.L. Raymer, “ Identifying Patterns in DNA Change”, Proceedings of the 2003 Midwest Artificial Intelligence and Cognitive Science Conference, vol. 34, April 2003, pp. 78-84. Columbus OH.
G.B. Golding, “DNA and the revolutions of molecular evolution, computational biology, and bioinformatics”, Genome, 46, 2003, pp.930 – 935.
T. Doom, M. Raymer, and D. Krane, “Bioinformatics”, IEEE Potentials, February/March 2004, pp.24 – 27.
Flash Animation of Alu amplification <http://www.geneticorigins.org/geneticorigins/pv92/media4.html>
K. Hammarstrom, G. Westin, C. Bark, J. Zabielski, and U. Petterson, “Genes and pseudogenes for human U2 RNA. Implications for the mechanism of pseu-dogene formation”, Journal of Mol. Bio., vol. 179, 1984, pp. 157 – 169.
D. Stoppa-Lyonnet, P.E. Carter, T. Meo, and M. Tosi, “Clusters of intragenic Alu repeats predispose the hu-man C1 inhibitor locus to deleterious rearrangements” Proc. Natl. Acad. Sci., vol. 87, February 1990, pp. 1551 – 1555.
National Center
for Biotechnology Information (NCBI), January 27, 2005,
<ftp://ftp.ncbi.nih.gov/genbank/>.
Designed by Sridhar Ramachandran and Dr. Travis E Doom (06/01/2005)
This Page has been viewed times