My assignment is to compress a DNA sequence. First enconding using a = 00 c = 01 g = 10 t = 11. I have to read in from a file the sequence and covert to my encoding. i know i have to use the bitSet class in java, but I'm having issues with how to implement. How do I ensure my encoding is used and the letters are not converted to actual binary.
this is the prompt: Develop space efficient Java code for two kinds of compressed encodings of this file of data. (N's are to be ignored). Convert lower case to upper case chars. Do the following and answer the questions: Credit will be awarded to both time and space efficient mechanisms. If your code takes too long to run, you need to rethink design.
Encoding 1. Using two bits A:00, C:01, G:10, T:11.
(a) How many total bits are needed to represent the genome sequence ? (b) how many of the total bits are 1's in the encoded sequence?
i know the logic i have to use, but the actual implementation of the bitSet class and the encoding is where i'm having issues.
You can have a look at BinCodec that provides binary encoding/decoding procedures to convert back and forth DNA and protein sequences to/from binary compact representation. It relies on the use of standard Java BitSet. Have also a look at BinCodedTest that shows how to use these APIs.
I've made an example below of how you can convert the 'C' letter into bits. So for the "CCCC" string it should print "01010101".
I believe all you need to understand from the BitSet in order to do your assignment are the methods: set, clear and get. Hope it helps.
Welcome to StackOverflow! Please look at certain Forward Genetic simulator that is being developed on github. It contains BitSetDNASequence class that may be helpful for creation of your BitMask. Of course it'll serve more of a guideline that 1:1 solution to your problem, but it definitely may get you up to speed.