LSTM project not compatible with CSV format

2019-08-16 07:03发布

问题:

I am trying to replicate Chevalier's LSTM Human Activity Recognition algorithm and came across a problem when I was trying to implement my own data in a CSV format. The format used in the git was txt. My CSV data is of the following format:

0.000995,8
0.020801,8
0.040977,8
0.060786,8
0.080970,8
...            ...

The original file can be found here. The x-values (time) are in column 0 (-80.060003, etc.) and the y-values (value) are in column 1 (8, 8, etc.). I tried to use pandas

pandas.read_csv(DATASET_PATH + TRAIN + "data_train.csv", skiprows=1, header=None, sep=',', usecols=[0, 1])

but it does not seem to be compatible with the format of the data in the "Prepare Dataset" section (and possibly others as well):

TRAIN = "train/"
TEST = "test/"

# Load "X" (the neural network's training and testing inputs)

def load_X(X_signals_paths):
    X_signals = []

    for signal_type_path in X_signals_paths:
        file = open(signal_type_path, 'r')
        # Read dataset from disk, dealing with text files' syntax
        X_signals.append(
            [np.array(serie, dtype=np.float32) for serie in [
                row.replace('  ', ' ').strip().split(' ') for row in file
            ]]
        )
        file.close()

    return np.transpose(np.array(X_signals), (1, 2, 0))

X_train_signals_paths = [
    DATASET_PATH + TRAIN + "Inertial Signals/" + signal + "train.txt" for signal in INPUT_SIGNAL_TYPES
]
X_test_signals_paths = [
    DATASET_PATH + TEST + "Inertial Signals/" + signal + "test.txt" for signal in INPUT_SIGNAL_TYPES
]

X_train = load_X(X_train_signals_paths)
X_test = load_X(X_test_signals_paths)


# Load "y" (the neural network's training and testing outputs)

def load_y(y_path):
    file = open(y_path, 'r')
    # Read dataset from disk, dealing with text file's syntax
    y_ = np.array(
        [elem for elem in [
            row.replace('  ', ' ').strip().split(' ') for row in file
        ]], 
        dtype=np.int32
    )
    file.close()

    # Substract 1 to each output class for friendly 0-based indexing 
    return y_ - 1

y_train_path = DATASET_PATH + TRAIN + "y_train.txt"
y_test_path = DATASET_PATH + TEST + "y_test.txt"

y_train = load_y(y_train_path)
y_test = load_y(y_test_path)

This was what is happening with my implementation via iPython3:

In[0]:

TRAIN = "train/"
TEST = "test/"

def load_X(X_signals_paths):
    X_signals = []
    for signal_type_path in X_signals_paths:
        file = pandas.read_csv(DATASET_PATH + TRAIN + "data_train.csv", skiprows=1, header=None, sep=',', usecols=[0])
        X_signals.append(
            [np.array(serie, dtype=np.float32) for serie in [
                str(row).replace('  ', ' ').strip().split(' ') for row in file
            ]]
        )

    return np.transpose(np.array(X_signals), (1, 2, 0))

_train_signals_paths = [
    DATASET_PATH + TRAIN + signal + "train.csv" for signal in INPUT_SIGNAL_TYPES
]
X_test_signals_paths = [
    DATASET_PATH + TEST + signal + "test.csv" for signal in INPUT_SIGNAL_TYPES
]

X_train = load_X(X_train_signals_paths)
X_test = load_X(X_test_signals_paths)
print(X_train, X_test)

Out[0]:

[[[ 0.]]] [[[ 0.]]]

I hope that I could receive some help with properly formatting my data to work seamlessly with this algorithm. If there are any questions please let me know.

回答1:

The code in the trace differs from the code you actually posted in the question -- the working code is operating on a bare file handle, not a Pandas data frame.

For reference, here is the code from the project you are referring to again:

def load_X(X_signals_paths):
    X_signals = []

    for signal_type_path in X_signals_paths:
        file = open(signal_type_path, 'r')
        # ^ the error comes where you have file = pandas.read_csv(...)

        # Read dataset from disk, dealing with text files' syntax
        X_signals.append(
            [np.array(serie, dtype=np.float32) for serie in [
                row.replace('  ', ' ').strip().split(' ') for row in file
            ]]
        )
        file.close()

file is just an iterator which returns a raw line (a sequence of characters) ending with a newline; on this input, it makes sense to strip newlines and squeeze spaces. But your code already opens, parses, and reformats the contents of the file into a Pandas data frame, which doesn't have newlines or spaces, just the numbers already parsed. Maybe fall back to the upstream code; or if there is something you want to change in there, figure out how to ask about that change. There's nothing wrong with the CSV as such.

Python has a quite capable csv module so maybe simply use that instead of manually parsing out the individual fields from the CSV.

    for signal_type_path in X_signals_paths:
        with open(signal_type_path, 'r') as csvfile:
            reader = csv.reader(csvfile)
            X_signals.append([np.array(row[0:2], dtype=np.float32) for row in reader])

Or as a minimal change, split on commas instead of spaces. (Your data looks like you don't actually need to remove spaces then.)

Also, tangentially, your code hardcodes the file it reads. It's probably better to keep the DATASET_PATH and TRAIN parameters entirely in the calling code, and have load_X simply accept a list of full file paths, which it accepts without modifying them in any way.