This is a common issue I have and my solution is a bit brash. So I'm looking for a quick fix and explanation of the problem.
The problem is that when I decide to save a spreadsheet in excel (mac 2011) as a tab delimited file it seems to do it perfectly fine. Until I try to parse the file line by line using Perl. For some reason it slurps the whole document in one line.
My brutish solution is to open the file in a web browser and copy and paste the information into the tab delimited file in TextEdit (I never use rich text format). I tried introducing a newline in the end of the file before doing this fix and it does not resolve the issue.
What's going on here? An explanation would be appreciated.
~Thanks!~
The problem is the actual character codes that define new lines on different systems. Windows systems commonly use a CarriageReturn+LineFeed (CRLF) and *NIX systems use only a LineFeed (LF).
These characters can be represented in RegEx as \r\n or \n (respectively).
Sometimes, to hash through a text file, you need to parse New Line characters. Try this for DOS-to-UNIX in perl:
perl -pi -e 's/\r\n/\n/g' input.file
or, for UNIX-to-DOS using sed:
$ sed 's/$'"/`echo \\\r`/" input.txt > output.txt
or, for DOS-to-UNIX using sed:
$ sed 's/^M$//' input.txt > output.txt
Found a pretty simple solution to this. Copy data from Excel to clipboard, paste it into a google spreadsheet. Download google spreadsheet file as a 'tab-separated values .tsv'. This gets around the problem and you have tab delimiters with an end of line for each line.
Perl has a useful regex pattern \R
which will match any common line ending. It actually matches any vertical whitespace -- the same as \v
-- or the CR LF combination, so it's the same as \r\n|\v
This is useful here because you can slurp your entire file into a single scalar and then split /\R/
, which will give you a list of file records, already chomp
ed (if you want to keep the line terminators you can split /\R\K/
instead
Another option is the PerlIO::eol
module. It provides a new Perl IO layer that will normalize line endings no matter what the contents of the file are
Once you have loaded the module with use PerlIO::eol
you can use it in an open
statement
open my $fh, '<:eol(LF)', 'myfile.tsv' or die $!;
or you can use the open
pragma to set it as the default layer for all input file handles
use open IN => ':raw:eol(LF)';
which will work fine with an input file from any platform