Import CSV file with mixed data types

2019-01-05 01:21发布

I'm working with MATLAB for few days and I'm having difficulties to import a CSV-file to a matrix.

My problem is that my CSV-file contains almost only Strings and some integer values, so that csvread() doesn't work. csvread() only gets along with integer values.

How can I store my strings in some kind of a 2-dimensional array to have free access to each element?

Here's a sample CSV for my needs:

04;abc;def;ghj;klm;;;;;
;;;;;Test;text;0xFF;;
;;;;;asdfhsdf;dsafdsag;0x0F0F;;

The main thing are the empty cells and the texts within the cells. As you see, the structure may vary.

9条回答
Deceive 欺骗
2楼-- · 2019-01-05 01:47

For the case when you know how many columns of data there will be in your CSV file, one simple call to textscan like Amro suggests will be your best solution.

However, if you don't know a priori how many columns are in your file, you can use a more general approach like I did in the following function. I first used the function fgetl to read each line of the file into a cell array. Then I used the function textscan to parse each line into separate strings using a predefined field delimiter and treating the integer fields as strings for now (they can be converted to numeric values later). Here is the resulting code, placed in a function read_mixed_csv:

function lineArray = read_mixed_csv(fileName, delimiter)

  fid = fopen(fileName, 'r');         % Open the file
  lineArray = cell(100, 1);           % Preallocate a cell array (ideally slightly
                                      %   larger than is needed)
  lineIndex = 1;                      % Index of cell to place the next line in
  nextLine = fgetl(fid);              % Read the first line from the file
  while ~isequal(nextLine, -1)        % Loop while not at the end of the file
    lineArray{lineIndex} = nextLine;  % Add the line to the cell array
    lineIndex = lineIndex+1;          % Increment the line index
    nextLine = fgetl(fid);            % Read the next line from the file
  end
  fclose(fid);                        % Close the file

  lineArray = lineArray(1:lineIndex-1);              % Remove empty cells, if needed
  for iLine = 1:lineIndex-1                          % Loop over lines
    lineData = textscan(lineArray{iLine}, '%s', ...  % Read strings
                        'Delimiter', delimiter);
    lineData = lineData{1};                          % Remove cell encapsulation
    if strcmp(lineArray{iLine}(end), delimiter)      % Account for when the line
      lineData{end+1} = '';                          %   ends with a delimiter
    end
    lineArray(iLine, 1:numel(lineData)) = lineData;  % Overwrite line data
  end

end

Running this function on the sample file content from the question gives this result:

>> data = read_mixed_csv('myfile.csv', ';')

data = 

  Columns 1 through 7

    '04'    'abc'    'def'    'ghj'    'klm'    ''            ''        
    ''      ''       ''       ''       ''       'Test'        'text'    
    ''      ''       ''       ''       ''       'asdfhsdf'    'dsafdsag'

  Columns 8 through 10

    ''          ''    ''
    '0xFF'      ''    ''
    '0x0F0F'    ''    ''

The result is a 3-by-10 cell array with one field per cell where missing fields are represented by the empty string ''. Now you can access each cell or a combination of cells to format them as you like. For example, if you wanted to change the fields in the first column from strings to integer values, you could use the function str2double as follows:

>> data(:, 1) = cellfun(@(s) {str2double(s)}, data(:, 1))

data = 

  Columns 1 through 7

    [  4]    'abc'    'def'    'ghj'    'klm'    ''            ''        
    [NaN]    ''       ''       ''       ''       'Test'        'text'    
    [NaN]    ''       ''       ''       ''       'asdfhsdf'    'dsafdsag'

  Columns 8 through 10

    ''          ''    ''
    '0xFF'      ''    ''
    '0x0F0F'    ''    ''

Note that the empty fields results in NaN values.

查看更多
狗以群分
3楼-- · 2019-01-05 01:47
% Assuming that the dataset is ";"-delimited and each line ends with ";"
fid = fopen('sampledata.csv');
tline = fgetl(fid);
u=sprintf('%c',tline); c=length(u);
id=findstr(u,';'); n=length(id);
data=cell(1,n);
for I=1:n
    if I==1
        data{1,I}=u(1:id(I)-1);
    else
        data{1,I}=u(id(I-1)+1:id(I)-1);
    end
end
ct=1;
while ischar(tline)
    ct=ct+1;
    tline = fgetl(fid);
    u=sprintf('%c',tline);
    id=findstr(u,';');
    if~isempty(id)
        for I=1:n
            if I==1
                data{ct,I}=u(1:id(I)-1);
            else
                data{ct,I}=u(id(I-1)+1:id(I)-1);
            end
        end
    end
end
fclose(fid);
查看更多
Melony?
4楼-- · 2019-01-05 01:48

In R2013b or later you can use a table:

>> table = readtable('myfile.txt','Delimiter',';','ReadVariableNames',false)
>> table = 

    Var1    Var2     Var3     Var4     Var5        Var6          Var7         Var8      Var9    Var10
    ____    _____    _____    _____    _____    __________    __________    ________    ____    _____

      4     'abc'    'def'    'ghj'    'klm'    ''            ''            ''          NaN     NaN  
    NaN     ''       ''       ''       ''       'Test'        'text'        '0xFF'      NaN     NaN  
    NaN     ''       ''       ''       ''       'asdfhsdf'    'dsafdsag'    '0x0F0F'    NaN     NaN  

Here is more info.

查看更多
倾城 Initia
5楼-- · 2019-01-05 01:48

Depending on the format of your file, importdata might work.

You can store Strings in a cell array. Type "doc cell" for more information.

查看更多
爷的心禁止访问
6楼-- · 2019-01-05 01:52

If your input file has a fixed amount of columns separated by commas and you know in which columns are the strings it might be best to use the function

textscan()

Note that you can specify a format where you read up to a maximum number of characters in the string or until a delimiter (comma) is found.

查看更多
Bombasti
7楼-- · 2019-01-05 01:54

Use xlsread, it works just as well on .csv files as it does on .xls files. Specify that you want three outputs:

[num char raw] = xlsread('your_filename.csv')

and it will give you an array containing only the numeric data (num), an array containing only the character data (char) and an array that contains all data types in the same format as the .csv layout (raw).

查看更多
登录 后发表回答