In the following data, I am trying to run a simple markov model.
Say I have a data with following structure:
pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8
1 A T T A A G A C A|C C G C T T A G A
2 T G C T G T T G T|A A T A T C A A T
3 C A A C A G T C C|G G A C G C G C G
4 G T G T A T C T G|T C T T T A T C T
Block M represents data from one set of catergories, so does block S.
The data are the strings
which are made by connecting letter along the position line. So, the string value for M1 is A-T-C-G, and so is for every other block.
There is also one hybrid block
that has two string which is read in same way. The question is I want to find which string in the hybrid block most likely came from which block (M vs. S)?
I am trying to build a markov model which can help me identify which string in hybrid block
came from which blocks. In this example I can tell that in hybrid block ATCG
came from block M
and CAGT
came from block S
.
I am breaking the problem into different parts to read and mine the data:
Problem level 01:
- First I read the first line (the header) and create
unique keys
for all the columns. - Then I read the 2nd line (
pos
with value 1) and create another key. In the same line I read the value fromhybrid_block
and read the strings value in it. Thepipe |
is just a separator, so two strings are inindex 0 and 2
asA
andC
. So, all I want from this line is a
defaultdict(<class 'dict'>, {'M1': ['A'], 'M2': ['T'], 'M3': ['T']...., 'hybrid_block': ['A'], ['C']...}
As, I progress with reading the line, I want to append the strings values from each column and finally create.
defaultdict(<class 'dict'>, {'M1': ['A', 'T', 'C', 'G'], 'M2': ['T', 'G', 'A', 'T'], 'M3': ['T', 'C', 'A', 'G']...., 'hybrid_block': ['A', 'T', 'C', 'G'], ['C', 'A', 'G', 'T']...}
Problem level 02:
I read the data in
hybrid_block
for the first line which areA and C
.Now, I want to create
keys' but unlike fixed keys, these key will be generated while reading the data from
hybrid_blocks. For the first line since there are no preceding line the
keyswill simply be
AgAand
CgCwhich means (A given A, and C given C), and for the values I count the number of
Ain
block Mand
block S`. So, the data will be stored as:
defaultdict(<class 'dict'>, {'M': {'AgA': [4], 'CgC': [1]}, 'S': {'AgA': 2, 'CgC': 2}}
As, I read through other lines I want to create new keys based on what are the strings in hybrid block
and count the number of times that string was present in M vs S
block given the string in preceeding line. That means the keys
while reading line 2
would be TgA' which means (T given A) and AgC. For the values inside this key I count the number of times I found
T in this line, after A in the previous lineand same for
AcG`.
The defaultdict
after reading 3 lines would be.
defaultdict(<class 'dict'>, {'M': {'AgA': 4, 'TgA':3, 'CgT':2}, {'CgC': [1], 'AgC':0, 'GgA':0}, 'S': {'AgA': 2, 'TgA':1, 'CgT':0}, {'CgC': 2, 'AgC':2, 'GgA':2}}
I understand this looks too complicated. I went through several dictionary
and defaultdict
tutorial but couldn't find a way of doing this.
Solution to any part if not both is highly appreciated.
pandas
setupsolution
mostly
pandas
with somenumpy
'AgA'
type stringsAssign convenient blocks to their own variable names
Count how many are in each block and concatenate
If you really want a dictionary