I have a huge .csv
file like this :
Transcript Id Gene Id(name) Mirna Name miTG score
ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1
UTR3 21:30717114-30717142 0.05994568
UTR3 21:30717414-30717442 0.13591267
ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1
UTR3 6:105526681-105526709 0.133514751
and I want to build a matrix like this from it :
Transcript Id Gene Id(name) Mirna Name miTG score UTR3 MRE_score
ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 21:30717414-30717442 0.13591267
I want to add three new columns into my new matrix called UTR3
, MRE_score
and CDS
.
For every Gene ID
(for example ENST00000286800
), there are several UTR3
in the original matrix (here two UTR3
's for ENST00000286800
, and one UTR3
for ENST00000345080
) we choose the UTR3
with the highest score in the third column. In the new matrix, the value of UTR3
for every Gene ID
will be the value of UTR3
in the second column of the original matrix.
Can any body help me to reshape this data and build my new matrix?
Using this test data:
read it all in and set the names,
nms
for the output. Then calculate the grouping vector,cs
, using a cumulative sum. non-duplicates are the first row of each group and duplicates are the following rows. Merge these two sets of rows by group and extract out the highestMRE_score
in each group:Here we get:
Note that the question refers to a
CDS
column but what it is is not described nor does it appear in the example output so we ignored it.You could try to structure the CSV using regular expressions: