How to read two lines from a file and create dynam

2020-02-12 08:24发布

In the following data, I am trying to run a simple markov model.

Say I have a data with following structure:

pos   M1  M2  M3  M4  M5  M6  M7  M8  hybrid_block    S1    S2    S3    S4  S5  S6  S7  S8
1     A   T   T   A   A   G   A   C       A|C         C     G     C     T    T   A   G   A
2     T   G   C   T   G   T   T   G       T|A         A     T     A     T    C   A   A   T
3     C   A   A   C   A   G   T   C       C|G         G     A     C     G    C   G   C   G
4     G   T   G   T   A   T   C   T       G|T         C     T     T     T    A   T   C   T 

Block M represents data from one set of catergories, so does block S.

The data are the strings which are made by connecting letter along the position line. So, the string value for M1 is A-T-C-G, and so is for every other block.

There is also one hybrid block that has two string which is read in same way. The question is I want to find which string in the hybrid block most likely came from which block (M vs. S)?

I am trying to build a markov model which can help me identify which string in hybrid block came from which blocks. In this example I can tell that in hybrid block ATCG came from block M and CAGT came from block S.

I am breaking the problem into different parts to read and mine the data:

Problem level 01:

  • First I read the first line (the header) and create unique keys for all the columns.
  • Then I read the 2nd line (pos with value 1) and create another key. In the same line I read the value from hybrid_block and read the strings value in it. The pipe | is just a separator, so two strings are in index 0 and 2 as A and C. So, all I want from this line is a

defaultdict(<class 'dict'>, {'M1': ['A'], 'M2': ['T'], 'M3': ['T']...., 'hybrid_block': ['A'], ['C']...}

As, I progress with reading the line, I want to append the strings values from each column and finally create.

defaultdict(<class 'dict'>, {'M1': ['A', 'T', 'C', 'G'], 'M2': ['T', 'G', 'A', 'T'], 'M3': ['T', 'C', 'A', 'G']...., 'hybrid_block': ['A', 'T', 'C', 'G'], ['C', 'A', 'G', 'T']...}

Problem level 02:

  • I read the data in hybrid_block for the first line which are A and C.

  • Now, I want to create keys' but unlike fixed keys, these key will be generated while reading the data fromhybrid_blocks. For the first line since there are no preceding line thekeyswill simply beAgAandCgCwhich means (A given A, and C given C), and for the values I count the number ofAinblock Mandblock S`. So, the data will be stored as:

defaultdict(<class 'dict'>, {'M': {'AgA': [4], 'CgC': [1]}, 'S': {'AgA': 2, 'CgC': 2}}

As, I read through other lines I want to create new keys based on what are the strings in hybrid block and count the number of times that string was present in M vs S block given the string in preceeding line. That means the keys while reading line 2 would be TgA' which means (T given A) and AgC. For the values inside this key I count the number of times I foundT in this line, after A in the previous lineand same forAcG`.

The defaultdict after reading 3 lines would be.

defaultdict(<class 'dict'>, {'M': {'AgA': 4, 'TgA':3, 'CgT':2}, {'CgC': [1], 'AgC':0, 'GgA':0}, 'S': {'AgA': 2, 'TgA':1, 'CgT':0}, {'CgC': 2, 'AgC':2, 'GgA':2}}

I understand this looks too complicated. I went through several dictionary and defaultdict tutorial but couldn't find a way of doing this.

Solution to any part if not both is highly appreciated.

1条回答
Explosion°爆炸
2楼-- · 2020-02-12 09:28

pandas setup

from io import StringIO
import pandas as pd
import numpy as np

txt = """pos   M1  M2  M3  M4  M5  M6  M7  M8  hybrid_block    S1    S2    S3    S4  S5  S6  S7  S8
1     A   T   T   A   A   G   A   C       A|C         C     G     C     T    T   A   G   A
2     T   G   C   T   G   T   T   G       T|A         A     T     A     T    C   A   A   T
3     C   A   A   C   A   G   T   C       C|G         G     A     C     G    C   G   C   G
4     G   T   G   T   A   T   C   T       G|T         C     T     T     T    A   T   C   T """

df = pd.read_csv(StringIO(txt), delim_whitespace=True, index_col='pos')

df

enter image description here

solution

mostly pandas with some numpy


  • split hybrid column
  • prepend identical first row
  • add with shifted version of self to get 'AgA' type strings

d1 = pd.concat([df.loc[[1]].rename(index={1: 0}), df])

d1 = pd.concat([
        df.filter(like='M'),
        df.hybrid_block.str.split('|', expand=True).rename(columns='H{}'.format),
        df.filter(like='S')
    ], axis=1)

d1 = pd.concat([d1.loc[[1]].rename(index={1: 0}), d1])
d1 = d1.add('g').add(d1.shift()).dropna()

d1

enter image description here

Assign convenient blocks to their own variable names

m = d1.filter(like='M')
s = d1.filter(like='S')
h = d1.filter(like='H')

Count how many are in each block and concatenate

mcounts = pd.DataFrame(
    (m.values[:, :, None] == h.values[:, None, :]).sum(1),
    h.index, h.columns
)
scounts = pd.DataFrame(
    (s.values[:, :, None] == h.values[:, None, :]).sum(1),
    h.index, h.columns
)

counts = pd.concat([mcounts, scounts], axis=1, keys=['M', 'S'])
counts

enter image description here

If you really want a dictionary

d = defaultdict(lambda:defaultdict(list))

dict_df = counts.stack().join(h.stack().rename('condition')).unstack()
for pos, row in dict_df.iterrows():
    d['M']['H0'].append((row.loc[('condition', 'H0')], row.loc[('M', 'H0')]))
    d['S']['H0'].append((row.loc[('condition', 'H0')], row.loc[('S', 'H0')]))
    d['M']['H1'].append((row.loc[('condition', 'H1')], row.loc[('M', 'H1')]))
    d['S']['H1'].append((row.loc[('condition', 'H1')], row.loc[('S', 'H1')]))

dict(d)

{'M': defaultdict(list,
             {'H0': [('AgA', 4), ('TgA', 3), ('CgT', 2), ('GgC', 1)],
              'H1': [('CgC', 1), ('AgC', 0), ('GgA', 0), ('TgG', 1)]}),
 'S': defaultdict(list,
             {'H0': [('AgA', 2), ('TgA', 1), ('CgT', 0), ('GgC', 0)],
              'H1': [('CgC', 2), ('AgC', 2), ('GgA', 2), ('TgG', 3)]})}
查看更多
登录 后发表回答