How to import and sort a poorly formed stacked CSV

2019-07-26 21:53发布

问题:

  1. How can I import and sort this data (following code section) to be readily manipulated by R?

  2. Are the organ names, dose unit 'Gy', volume unit 'CC' all three considered 'factors' by R? What is the terminology for the data set name and data variables?

These histograms place one data set sequentially after the other as follows:

Example Data File:

Bladder,,
GY, (CC),
0.0910151,1.34265
0.203907,1.55719
[skipping to end of this data set]
57.6659,0.705927
57.7787,0.196091
,,
CTV-operator,,
GY, (CC),
39.2238,0.00230695
39.233,0
[repeating for remainder of data sets; skipping to end of file]
53.1489,0
53.2009,0.0161487
,,
[blank line]

Data set labels (e.g. Bladder, CTV-operator, Rectum) are sometimes lowercase, and generally in a random order within the file. I have dozens of files categorized in two folders to import and analyze as one large patient sample.

I have started this script, but I suspect there is a better way:

[file = file.path()]
DVH = read.csv(file, header = FALSE, sep = ",", fill = TRUE)

DVH[3] <- NULL      # delete last column from data
loop = 1; notover = TRUE
factor(DVH[loop,1]) # Store the first element as a factor
while(notover)
 {loop = loop + 1   # move to next line
  DVH$1<-factor(DVH[loop,1]) # I must change ...
  DVH$2<-factor(DVH[loop,2]) # ... these lines.

  if([condition indicating end of file; code to be learned]) {notover = FALSE}
 }
# store first element as data label
# store next element as data label
# store data for (observations given) this factor
# if line is blank, move to next line, store first element as new factor, and repeat until end of file

Walter Roberson helped me prepare this code to import and parse the data in MATLAB, and so far I have more or less been trying to do the same thing in R:

for fileloop = 1:length(indexnumber)
    num = 0;
fid = fopen(['filepath to folder',num2str(indexnumber(fileloop)),'.csv'],'rt');
   while true 
     H1 = fgetl(fid) ;
     if feof(fid); break; end 
     H2 = fgetl(fid) ;
     if feof(fid); break; end 
     datacell = textscan(fid, '%f%f', 'delimiter', ',', 'collectoutput', true) ;
     if isempty(datacell) || isempty(datacell{1}); break; end 
     if any(isnan(datacell{1}(end,:))); datacell{1}(end,:) = []; end
     num = num + 1;
     headers(num,:) = {H1, H2} ;
     data(num) = datacell;
   end
   fclose(fid);
   clear datacell H1 H2

Additional Info:

I am new to R with intermediate MATLAB experience. I am switching from MATLAB to R so that my work may be more readily reproducible by others worldwide. (R is free; MATLAB is not.)

This data is from exporting dose-volume histograms from radiation oncology software Velocity for cancer therapy research.

(I asked this question previously for Python but a computer scientist recommended I use R instead.)

Thank you for your time.

回答1:

Here is an alternate version which should work much quicker than processing the file line by line in a for loop. This version reads the entire data file first to a single column data frame and then cleans up the data, which should be much faster than processing via the for loop.

# Load required library
  library(tidyr)

# Create function to process file
  process.file <- function(path){

  # Import data into a single column dataframe
    df <- as.data.frame(scan(path, character(), sep = "\n", quiet = TRUE), stringsAsFactors = FALSE)

  # Set column name
    colnames(df) <- "col1"

  # Copy organ names to new column
    df$organ <- sapply(df[,1], function(x) ifelse(regmatches(x, regexpr(".{2}$", x)) == ",,", gsub('.{2}$', '', x), NA))

  # Fill organ name for all rows
    df <- fill(df, organ, .direction = "down")

  # Remove the rows that contained the organ
    df <- df[regmatches(df[,1], regexpr(".{2}$", df[,1])) != ",,", ]

  # Copy units into a new column
    df$units <- sapply(df[,1], function(x) ifelse(regmatches(x, regexpr(".{1}$", x)) == ",", gsub('.{1}$', '', x), NA))

  # Fill units field for all rows
    df <- fill(df, units, .direction = "down")

  # Separate units into dose.unit and vol.unit columns
    df <- separate(df, units, c("dose.unit","vol.unit"), ", ")

  # Remove the rows that contained the units
    df <- df[regmatches(df[,1], regexpr(".{1}$", df[,1])) != ",", ]

  # Separate the remaining data into dosage and volume columns
    df <- separate(df, col1, c("dosage","volume"), ",")

  # Set data type of dosage and volume to numeric
    df[,c("dosage","volume")] <- lapply(df[,c("dosage","volume")], as.numeric)

  # Reorder columns
    df <- df[, c("organ","dosage","dose.unit","volume","vol.unit")]

  # Return the dataframe
  return(df)
}

# Set path to root folder directory
source.dir <- # Path to root folder here

# Retrieve all files from folder
# NOTE: To retrieve all files from the folder and all of it's subfolders, set: recursive = TRUE
# NOTE: To only include files with certain words in the name, include: pattern = "your.pattern.here"
files <- list.files(source.dir, recursive = FALSE, full.names = TRUE)

# Process each file and store dataframes in list
ldf <- lapply(files, process.file)

# Combine all dataframes to a single dataframe
final.df <- do.call(rbind, ldf)


回答2:

This should read the file into a nicely structured dataframe for further processing. It will allow you to process multiple files and union the data into one dataframe. There are more efficient and dynamic ways to handle getting the file paths, but this should give you a starting point.

# Create function to process a file
process.file <- function(filepath){
  # Open connection to file
  con = file(filepath, "r")

  # Create empty dataframe
  df <- data.frame(Organ = character(),
                           Dosage = numeric(),
                           Dosage.Unit = character(),
                           Volume = numeric(),
                           Volumne.Unit = character(),
                           stringsAsFactors = FALSE)

  # Begin looping through file
  while ( TRUE )
  {
    # Read current line
    line <- readLines(con, n = 1)
    # If at end of file, break the loop
    if ( length(line) == 0 ) { break }

    # If the current line is not equal to ",," and is not a blank line, then process the line
    if(line != ",," & line != ""){
      # If the last two characters of the line are ",,"
      if(substr(line, nchar(line) - 1, nchar(line)) == ",,"){
        # Remove the commas from the line and set the organ type
        organ <- gsub(",,","",line)
      } 
      # If the last character of the line is equal to ","
      else if(substr(line, nchar(line), nchar(line)) == ","){
        # Split the line at the comma
        units <- strsplit(line,",")

        # Set the dosage unit and volume unit
        dose.unit <- units[[1]][1]
        vol.unit <- units[[1]][2]
      }
      # If the line is not a special case
      else{
        # Split the line at the comma
        vals <- strsplit(line,",")

        # Set the dosage value and the volume value
        dosage <- vals[[1]][1]
        volume <- vals[[1]][2]

        # Add the values into the dataframe
        df <- rbind(df, as.data.frame(t(c(organ,dosage,dose.unit,volume,vol.unit))))
      }
    }
  }

  # Set the column names for the dataframe
  colnames(df) <- c("Organ","Dosage","Dosage.Unit","Volume","Volume.Unit")

  # Close the connection to a file
  close(con)

  # Return the dataframe
  return(df)
}


# Create a vector of the files to process
filenames <- c("C:/path/to/file/file1.txt",
               "C:/path/to/file/file2.txt",
               "C:/path/to/file/file3.txt",
               "C:/path/to/file/file4.txt")

# Create a dataframe to hold processed data
df.patient.sample <- data.frame(Organ = character(),
                                Dosage = numeric(),
                                Dosage.Unit = character(),
                                Volume = numeric(),
                                Volumne.Unit = character(),
                                stringsAsFactors = FALSE)

# Process each file in the vector of filenames
for(f in filenames){
  df.patient.sample <- rbind(df.patient.sample, process.file(f))
}