I am currently making a script that analyses some genetic data and then produce the output on a coloured Word document. The script works, however, one method in the script is badly written, the method that creates the Word document.
The method creating the document creates a standalone HTML file, which is then saved with a 'docx' extension, which allows me to give different parts of the document different styles.
Below is the bare minimum to get this to work. It includes some sample input data which would be created in a different method just before the final step and stored in a hash, and the necessary methods.
require 'bio'
def make_hash(input_file)
input_read = Hash.new
biofastafile = Bio::FlatFile.open(Bio::FastaFormat, input_file)
biofastafile.each_entry do |entry|
input_read[entry.definition] = entry.aaseq
end
return input_read
end
def to_doc(hash, output, motif)
output_file = File.new(output, "w")
output_file.puts "<!DOCTYPE html><html><head><style> .id{font-weight: bold;} .signalp{color:#000099; font-weight: bold;} .motif{color:#FF3300; font-weight: bold;} h3 {word-wrap: break-word;} p {word-wrap: break-word; font-family:Courier New, Courier, Mono;}</style></head><body>"
hash.each do |id, seq|
sequence = seq.to_s.gsub("\[\"", "").gsub("\"\]", "")
id.scan(/(\w+)(.*)/) do |id_start, id_end|
output_file.puts "<p><span class=\"id\"> >#{id_start}</span><span>#{id_end}</span><br>"
output_file.puts "<span class=\"signalp\">"
sequence.scan(/(\w+)-(\w+)/) do |signalp, seq_end|
output_file.puts signalp + "</span>" + seq_end.gsub(/#{motif}/, '<span class="motif">\0</span>')
output_file.puts "</p>"
end
end
end
output_file.puts "</body></html>"
output_file.close
end
hash = make_hash("./sample.txt")
to_doc = to_doc(hash, "output.docx", "WL|KK|RR|KR|R..R|R....R"
This is some sample data. In reality, when analysing the genetic data from a species, this can be made up of many 100,000's of sequences:
>isotig00001_f4_14 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00001_f4_15 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL
>isotig00003_f6_8 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00003_f6_9 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL
>isotig00004_f6_8 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00004_f6_9 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL
>isotig00009_f2_3 - Signal P Cleavage Site => 22:23
MLKCFSIIMGLILLLEIGGGCA-IYFYRAQIQAQFQKSLTDVTITDYRENADFQDLIDALQSGLSCCGVNSYEDWDNNIYFNCSGPANNPEALWCAFLLLYTGSSKRSSQHPVRLWSSFPRTTKYFPHKDLHHWLCGYVYNVD
>isotig00009_f3_9 - Signal P Cleavage Site => 16:17
MKTGIIIFISTVVVLP-ITLKPCGVPFSCCIPDQASGVANTQCGYGVRSPEQQNTFHTKIYTTGCADMFTMWINRYLYYIAGIAGVIVLVELFGFCFAHSLINDIKRQKARWAHR
>isotig00009_f6_13 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00009_f6_14 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL
Each read is made of two parts: The seq id (the line starting with a >
) and the sequence. This is split, and stored in a hash in the make_hash
method.
This example:
>isotig00001_f4_14 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
Is made up of:
>isotig00001_f4_14 (the first part of the id - class="id")
Signal P Cleavage Site => 11:12 (the second part of the id - normal writing)
(new line)
MMHLLCIVLLL (first part of the sequence - class="signalp")
KW WL LL (the second part of the sequence - the motif KW will be class="motif")
In HTML it would produce:
<p>
<span class="id"> >isotig00001_f4_14</span><span>Signal P Cleavage Site => 11:12</span>
<br>
<span class="signalp">MMHLLCIVLL</span><span>KW</span><span class="motif">KW</span><span>LL</span>
Basically, I would like to rewrite the to_doc
method using a proper HTML templating script such as SLIM/HAML/NOKOGIRI/ERB. I have tried to get this done.
For some reason, a loop within a loop didn't work and creating an global variable to store these variables didn't work either.
The script above works, just save the sample data as "sample.txt" and then run the script.
I would be highly grateful for any help.
Here's a starting point:
require 'haml'
haml_doc = <<EOT
%html
%head
:css
.id {font-weight: bold;}
.signalp {color:#000099; font-weight: bold;}
.motif {color:#FF3300; font-weight: bold;}
h3 {word-wrap: break-word;}
p {word-wrap: break-word; font-family:Courier New, Courier, Mono;}
%body
EOT
engine = Haml::Engine.new(haml_doc)
puts engine.render
Which outputs this when run:
<html>
<head>
<style>
.id {font-weight: bold;}
.signalp {color:#000099; font-weight: bold;}
.motif {color:#FF3300; font-weight: bold;}
h3 {word-wrap: break-word;}
p {word-wrap: break-word; font-family:Courier New, Courier, Mono;}
</style>
</head>
<body></body>
</html>
From there, you can easily write to a file using:
File.write(output, engine.render)
instead of using puts
to output it to the console.
To use this, you need to flesh out the haml_doc
with additional Haml to loop over your input data and massage it into an array or hash that you can iterate over cleanly, without embedding all sorts of scan
and conditional logic. A view should be primarily used to output content, not manipulate data.
Just above the engine = Haml...
line you'd want to read your input data and massage it, and store it in an instance variable that Haml can iterate over. You have the basic idea in your original code but instead of trying to output HTML, create an object or sub-hash that you can pass to Haml.
Normally this would all be separated into separate files for the model, the view and the controller, like in Rails or big Sinatra apps, but this really isn't a big app, so you can put it all in one file. Keep your logic clean and it'll be fine.
Without sample input data and an expected output it's hard to do more, but that'll give you a starting point.
Based on the data samples, here's something that gets in you the ballpark. I won't polish it because, after all, you have to do some of it, but this is a reasonable start. The first part is mocking up something reasonably like the Bio you reference in your code, but which I've never seen. You don't need this part, but might want to look through it:
module Bio
FastaFormat = 1
SAMPLE_DATA = <<-EOT
>isotig00001_f4_14 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00001_f4_15 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL
>isotig00003_f6_8 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00003_f6_9 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL
>isotig00004_f6_8 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00004_f6_9 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL
>isotig00009_f2_3 - Signal P Cleavage Site => 22:23
MLKCFSIIMGLILLLEIGGGCA-IYFYRAQIQAQFQKSLTDVTITDYRENADFQDLIDALQSGLSCCGVNSYEDWDNNIYFNCSGPANNPEALWCAFLLLYTGSSKRSSQHPVRLWSSFPRTTKYFPHKDLHHWLCGYVYNVD
>isotig00009_f3_9 - Signal P Cleavage Site => 16:17
MKTGIIIFISTVVVLP-ITLKPCGVPFSCCIPDQASGVANTQCGYGVRSPEQQNTFHTKIYTTGCADMFTMWINRYLYYIAGIAGVIVLVELFGFCFAHSLINDIKRQKARWAHR
>isotig00009_f6_13 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00009_f6_14 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL
EOT
class FlatFile
class Entry
attr_reader :definition, :aaseq
def initialize(definition, aaseq)
@definition = definition
@aaseq = aaseq
end
end
def initialize
end
def self.open(filetype, filename)
SAMPLE_DATA.split("\n").each_slice(2).map{ |seq_id, sequence| Entry.new(seq_id, sequence) }
end
def each_entry
@sample_data.each do |_entry|
yield _entry
end
end
end
end
Here's where the fun begins. I modified your get_hash
routine to parse the strings how I'd do it. Instead of a hash, it returns an array of hashes. Each sub-hash is ready to be used, in other words, the data is parsed and ready to be output:
include Bio
def make_array_of_hashes(input_file)
Bio::FlatFile.open(
Bio::FastaFormat,
input_file
).map { |entry|
id_start, id_end = entry.definition.split('-').map(&:strip)
signalp, seq_end = entry.aaseq.split('-')
motif = seq_end.scan(/(?:WL|KK|RR|KR|R..R|R....R)/)
{
:id_start => id_start,
:id_end => id_end,
:signalp => signalp,
:motif => motif
}
}
end
This is a simple way to define the HAML document inside the body of a script. I only output, there's no logic in the template except to loop. Everything else was handled prior to the view being processed:
haml_doc = <<EOT
!!!
%html
%head
:css
.id {font-weight: bold;}
.signalp {color:#000099; font-weight: bold;}
.motif {color:#FF3300; font-weight: bold;}
h3 {word-wrap: break-word;}
p {word-wrap: break-word; font-family:Courier New, Courier, Mono;}
%body
- data.each do |d|
%p
%span.id= d[:id_start]
%span= d[:id_end]
%br/
%span.signalp= d[:signalp]
- d[:motif].each do |m|
%span= m
EOT
And here's how to use it:
require 'haml'
data = make_array_of_hashes('sample.txt')
engine = Haml::Engine.new(haml_doc)
puts engine.render(Object.new, :data => data)
Which, when run outputs:
<!DOCTYPE html>
<html>
<head>
<style>
.id {font-weight: bold;}
.signalp {color:#000099; font-weight: bold;}
.motif {color:#FF3300; font-weight: bold;}
h3 {word-wrap: break-word;}
p {word-wrap: break-word; font-family:Courier New, Courier, Mono;}
</style>
</head>
<body></body>
<p>
<span class='id'>>isotig00001_f4_14</span>
<span>Signal P Cleavage Site => 11:12</span>
<br>
<span class='signalp'>MMHLLCIVLLL</span>
<span>WL</span>
</p>
<p>
<span class='id'>>isotig00001_f4_15</span>
<span>Signal P Cleavage Site => 10:11</span>
<br>
<span class='signalp'>MHLLCIVLLL</span>
<span>WL</span>
</p>
<p>
<span class='id'>>isotig00003_f6_8</span>
<span>Signal P Cleavage Site => 11:12</span>
<br>
<span class='signalp'>MMHLLCIVLLL</span>
<span>WL</span>
</p>
<p>
<span class='id'>>isotig00003_f6_9</span>
<span>Signal P Cleavage Site => 10:11</span>
<br>
<span class='signalp'>MHLLCIVLLL</span>
<span>WL</span>
</p>
<p>
<span class='id'>>isotig00004_f6_8</span>
<span>Signal P Cleavage Site => 11:12</span>
<br>
<span class='signalp'>MMHLLCIVLLL</span>
<span>WL</span>
</p>
<p>
<span class='id'>>isotig00004_f6_9</span>
<span>Signal P Cleavage Site => 10:11</span>
<br>
<span class='signalp'>MHLLCIVLLL</span>
<span>WL</span>
</p>
<p>
<span class='id'>>isotig00009_f2_3</span>
<span>Signal P Cleavage Site => 22:23</span>
<br>
<span class='signalp'>MLKCFSIIMGLILLLEIGGGCA</span>
<span>KR</span>
<span>WL</span>
</p>
<p>
<span class='id'>>isotig00009_f3_9</span>
<span>Signal P Cleavage Site => 16:17</span>
<br>
<span class='signalp'>MKTGIIIFISTVVVLP</span>
<span>KR</span>
</p>
<p>
<span class='id'>>isotig00009_f6_13</span>
<span>Signal P Cleavage Site => 11:12</span>
<br>
<span class='signalp'>MMHLLCIVLLL</span>
<span>WL</span>
</p>
<p>
<span class='id'>>isotig00009_f6_14</span>
<span>Signal P Cleavage Site => 10:11</span>
<br>
<span class='signalp'>MHLLCIVLLL</span>
<span>WL</span>
</p>
</html>