Reading from Text files in C

2019-08-09 10:09发布

问题:

A small question really. What would be the best for reading a text file containing X number of words, and adding each word, one by one to a linked list. i.e. The Frog Is Old.

Thus, The, Frog, Is and Old would each be put into a ListNode, all read from a file.

Really am wondering the best function to use in conjunction with fscanf, if fscanf is even the best option. All advice is great!

Cheers.

EDIT: My query is really, if I wanted to parse a large text file, would it be best to fscanf each word into an array one by one, add to list, free array, and repeat? Or is there a more effecient way

回答1:

The "%s" conversion specifier will match non-whitespace characters.

#define QUOTE(s) #s
#define STR(s) QUOTE(s)

#ifndef BUFSIZE
#  define BUFSIZE 255
#endif

char buf[BUFSIZE+1];
while (fscanf(fin, "%" STR(BUFSIZE) "s", buf)) {
    /* buf holds next word. Todo:
       + allocate space for word
       + copy word to newly allocated space
       + add to linked list
     */
}

Alternatively, strtok can be use to tokenize (break up) a string into substrings, using a set of characters (as a character array) you specify. Your system may also have strsep, which is intended to replace strtok. Both strtok and strsep modify the array you pass in, so take care that this won't cause issues with other parts of the code that accesses the data. strsep is not thread-safe; if you have multiple threads accessing the string to be parsed, use strsep or strtok_r.

#ifndef BUFSIZE
#  define BUFSIZE 256
#endif

const char separators[] = "\t\n\v\r\f !\"#$%&'()*+,-./:;<=>?@[\\]^`{|}~";
char buf[BUFSIZE], *line, *word, *rest;

while (fgets(buf, BUFSIZE+1, fin)) {
    rest = line = buf;
    while ((word = strtok_r(line, separators, &rest))) {
        /* Todo:
           + allocate space for word
           + copy word to newly allocated space
           + add to linked list
        */
        line=rest;
    }
}

Since the second example reads a line at a time from the file for strtok_r to work on, if any line of the file is over BUFSIZE-1 characters long and the BUFSIZE-1st and BUFSIZEth characters in a line are both letters, the second example will split words in two. A solution to this would be to create a buffered string stream, so that when the end of the buffer is reached, anything remaining in the buffer is shifted to the front and the rest of the buffer is filled with more data from the file (just be careful about words longer than the buffer; in production code, it's a potential security vulnerability that could lead to denial of service attacks).

An issue with all of the above functions is they don't handle null characters in input. If you wish to parse data that may contain null characters, you'll need to use a non-standard function, which includes writing your own.

As for efficiency, any algorithm you use is going to need to read from the file (which is O(n) in complexity, and will require I/O, slowing down the program) and allocate memory to store the words. Whether you use fscanf, strtok or some other method, the time and space complexity isn't likely to vary much; about the only thing that might is how many intermediate buffers get allocated. Your best bet to find the most efficient implementation is to try a couple and profile them.



回答2:

You shouldn't be looking for "a more efficient way" until you have "a way that's not efficient enough".

But something like strtok might fit your needs without so much mallocing. It lets you carve up the string in place. (Use With Caution!)



回答3:

If you look for high speed, on a modern desktop computer... You can go multi-thread.

  • One thread fills a buffer of character, say 4Ko, and does only this
  • One thread read the buffer, parse the words and add them to the list
  • One thread do whatever you feel to do on the list, if you don't need the list as a whole

The idea is that the process won't sleep while waiting for I/O. For additional speed if you have lot of CPU cores, is to cut the file in big chunks, and one core process one chunk. Lot of opportunities for complex code and bugs, but hey, speed is ain't cheap...



标签: c scanf