Monday, July 26, 2010

DNA sample compression test

Few months ago I had to run a test to see if DNA sequence files (FASTA files) compress better than simple text files. DNA files contains only 4 characters (A, C, G, T) so you will expect that they will compress really well compared with text files. However, the DNA code is pretty random (well there are some exceptions where the code follows some patterns or have repetitions – but there regions are rare).
So, here are the results.


FileSizeuncompSize
(ZIP comp)
Size
(RAR comp)
FASTA FILE - no cumments (3.0 KB).fasta3.0 KB715 B614 B
TEXT FILE - random text (3.0 KB).txt3.0 KB1544 B1275 B


What I have seen later is that if you pack multiple samples (even if they are from relatively different bacteria) together, the compression ratio can be better.




Test files used in this experiment:

a) FASTA FILE - no cumments (3.0 KB).fasta

TGGCGGCGTGCTTAACACATGCAAGTCGAACGAGAAATTCCCTGCTTGCAGGGAAGAGTAAAGTGGCGCA
CGGGTGAGTAACGCGTGGGTAACCTACCTTTGAATTCGGAATAGCCCGTCGAAAGGTGGATTAATACCGG
ATACGGTTTAAGGATCTTCGGATTTTTAAATTAAAGGTGACCTCTTCATGAAAGTTGCCGTTCATAGATG
GGCCCGCGTACCATTAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGACGATGGTTAGCTGGTCTGAGA
GGATGATCAGCCACACTGGAACTGGAACACGGTCCAGACTCCTACGGGAGGCAGCAGTGAGGAATTTTGC
GCAATGGGGGAAACCCTGACGCAGCAACGCCGCGTGAGCGAAGAAGGCCTTCGGGTCGTAAAGCTCTGTC
AAGTGGGAAAAAAATCTTTTGATGAATAGTTAAAAGACTTGATGGTACCACTGGAGGAAGCACCGGCTAA
CTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTGTTCGGAATCACTGGGCGTAAAGAGCGT
GTAGGCGGTTTGACAAGTCAGATGTGAAAGCCCCCGGGCTCAACCCGGGAAGTGCATTTGAAACTGTCTC
ACTAGAGTATGGGAGAGGAGATTGGAATTCCTGGTGTAGAGGTGAAATTCGTAGATATCAGGAGGAACAC
CCGTGGCGAAGGCGATTCTCTGGACCAATACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACAGGATT
AGATACCCTGGTAGTCCACGCCGTAAACGATGAGAACTAGGTGTAGTGGGTATTGACCCCTGCTGTGCCG
AAGTTAACGCATTAAGTTCTCCGCCCTGGGGGAGTACGGCCGCAAGGCTAAAACTCAAAGGAATTGACGG
GGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGGTTTGACA
TCCTTTGACCGTCTGTGAAAGCAGATTTTTCCGGCTTTGCCGGAACAGAGTGACAGGTGCTGCATGGCTG
TCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCAGCAACGAGCGTAACCCTTGTCTTTAGTTGCCAT
TATTAAGTTAGGCACTCTAAAGAGACTGCCTCGGTTAACGGGGAGGAAGGTGGGGATGACGTCAAGTCCC
TCATGGCCTTTATATCCAGGGCTACACACGTGCTACAATGGGCTGTACAAAGGGTTGCTATCCCGCGAGG
GGGCGCTAATCCCAAAAAGCAGTTCTCAGTTCGGATTGAAGTCTGCAACTCGACTTCATGAAGGTGGAAT
CGCTAGTAATCGTGGATCAGCATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACAC
CACGAAAGTCGACTGTACCAGAAGTTGCTGGGCTAACCTTTTCGGAGGAGGCAGGTACCTAAGGTACGGC
TGGCGGCGTGCTTAACACATGCAAGTCGAACGAGAAATTCCCTGCTTGCAGGGAAGAGTAAAGTGGCGCA
CGGGTGAGTAACGCGTGGGTAACCTACCTTTGAATTCGGAATAGCCCGTCGAAAGGTGGATTAATACCGG
ATACGGTTTAAGGATCTTCGGATTTTTAAATTAAAGGTGACCTCTTCATGAAAGTTGCCGTTCATAGATG
GGCCCGCGTACCATTAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGACGATGGTTAGCTGGTCTGAGA
GGATGATCAGCCACACTGGAACTGGAACACGGTCCAGACTCCTACGGGAGGCAGCAGTGAGGAATTTTGC
GCAATGGGGGAAACCCTGACGCAGCAACGCCGCGTGAGCGAAGAAGGCCTTCGGGTCGTAAAGCTCTGTC
AAGTGGGAAAAAAATCTTTTGATGAATAGTTAAAAGACTTGATGGTACCACTGGAGGAAGCACCGGCTAA
CTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTGTTCGGAATCACTGGGCGTAAAGAGCGT
GTAGGCGGTTTGACAAGTCAGATGTGAAAGCCCCCGGGCTCAACCCGGGAAGTGCATTTGAAACTGTCTC
ACTAGAGTATGGGAGAGGAGATTGGAATTCCTGGTGTAGAGGTGAAATTCGTAGATATCAGGAGGAACAC
CCGTGGCGAAGGCGATTCTCTGGACCAATACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACAGGATT
AGATACCCTGGTAGTCCACGCCGTAAACGATGAGAACTAGGTGTAGTGGGTATTGACCCCTGCTGTGCCG
AAGTTAACGCATTAAGTTCTCCGCCCTGGGGGAGTACGGCCGCAAGGCTAAAACTCAAAGGAATTGACGG
GGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGGTTTGACA
TCCTTTGACCGTCTGTGAAAGCAGATTTTTCCGGCTTTGCCGGAACAGAGTGACAGGTGCTGCATGGCTG
TCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCAGCAACGAGCGTAACCCTTGTCTTTAGTTGCCAT
TATTAAGTTAGGCACTCTAAAGAGACTGCCTCGGTTAACGGGGAGGAAGGTGGGGATGACGTCAAGTCCC
TCATGGCCTTTATATCCAGGGCTACACACGTGCTACAATGGGCTGTACAAAGGGTTGCTATCCCGCGAGG
GGGCGCTAATCCCAAAAAGCAGTTCTCAGTTCGGATTGAAGTCTGCAACTCGACTTCATGAAGGTGGAAT
CGCTAGTAATCGTGGATCAGCATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACAC
CACGAAAGTCGACTGTACCAGAAGTTGCTGGGCTAACCTTTTCGGAGGAGGCAGGTACCTAAGGTACGGC
CGGTAATTGGGGTGAAGTCGTAACAAGGTATCATTCAGTGATACTCGG


----------------------------------------------------------------------------------------------------------------

Test files used in this experiment:
b) TEXT FILE - random text (3.0 KB).txt

Cluster (computing)
A computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks. Clusters are usually deployed to improve performance and/or availability over that provided by a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.[1]

Cluster categorizations

High-availability (HA) clusters
High-availability clusters (also known as Failover Clusters) are implemented primarily for the purpose of improving the availability of services which the cluster provides. They operate by having redundant nodes, which are then used to provide service when system components fail. The most common size for an HA cluster is two nodes, which is the minimum requirement to provide redundancy. HA cluster implementations attempt to use redundancy of cluster components to eliminate single points of failure.
There are many commercial implementations of High-Availability clusters for many operating systems. The Linux-HA project is one commonly used free software HA package for the Linux OSs.

Load-balancing clusters
Load-balancing is when multiple computers are linked together to share computational workload or function as a single virtual computer. Logically, from the user side, they are multiple machines, but function as a single virtual machine. Requests initiated from the user are managed by, and distributed among, all the standalone computers to form a cluster. This results in balanced computational work among different machines, improving the performance of the cluster system.

Compute clusters
Often clusters are used for primarily computational purposes, rather than handling IO-oriented operations such as web service or databases. For instance, a cluster might support computational simulations of weather or vehicle crashes. The primary distinction within compute clusters is how tightly-coupled the individual nodes are. For instance, a single compute job may require frequent communication among nodes - this implies that the cluster shares a dedicated network, is densely located, and probably has homogenous nodes. This cluster design is usually referred to as Beowulf Cluster. The other extreme is where a compute job uses one or few nodes, and needs little or no inter-node communication. This latter category is sometimes called "Grid" computing. Tightly-coupled compute clusters are designed for work that might traditionally have been called "supercomputing". Middleware such as MPI (Message Passing Interface) or PVM (Parallel Virtual Machine) permits compute clustering programs to be portable to a wide variety of clusters.

Grid computing
Grids are usually computer clusters, but more focused on throughput like a computing utility rather than running fewer, tightly-coupled jobs. Often, grids will incorporate heterogeneous collections of computers, possibly distributed xxxxx

No comments:

Post a Comment