Using C/C++ to efficiently de-serialize a string c

2019-02-23 08:09发布

I have large strings that resemble the following...

some_text_token

24.325973 -20.638823  

-1.964366 0.753947  
-1.290811 -3.547422  
0.813014 -3.547227  

0.472015 3.723311  
-0.719116 3.676793  

other_text_token  

24.325973 20.638823  

-1.964366 0.753947  
-1.290811 -3.547422  
-1.996611 -2.877422  
0.813014 -3.547227  

1.632365 2.083673  
0.472015 3.723311  
-0.719116 3.676793  

...

...from which I'm trying to efficiently, and in the interleaved sequence they appear in the string, grab...

  1. the text tokens
  2. the float values
  3. the blank lines

...but I'm having trouble.

I've tried strtod and successfully grabbed the floats from the string, but I can't seem to get a loop using strtod to report back to me the interleaved text tokens and blank lines. I'm not 100% confident strtod is the "right track" given the interleaved tokens and blank lines that I'm also interested in.

The tokens and blank lines are present in the string to give context to the floats so my program knows what the float values occurring after each token are to be used for, but strtod seems more geared, understandably, toward just reporting back floats it encounters in a string without regard for silly things like blank lines or tokens.

I know this isn't very hard conceptually, but being relatively new to C/C++ I'm having trouble judging what language features I should focus on to take best advantage of the efficiency C/C++ can bring to bear on this problem.

Any pointers? I'm very interested in why various approaches function more or less efficiently. Thanks!!!

3条回答
Melony?
2楼-- · 2019-02-23 08:30

This is a bit crude and untested, but the general idea is to try parsing each line and see what's there:

while (!feof (stdin))
{
    char buf [100];
    (!fgets (buf, sizeof buf, stdin))
        break;  // end of file or error

    // skip leading whitespace
    char *cp = buf;
    while (isspace (*cp))
         ++cp;

    if (*cp == '\000')  // blank line?
    {
        do_whatever_for_a_blank_line ();
        continue;
    }

    // try reading a float
    double v1, v2;
    char *ep = NULL;
    v1 = strtod (cp, &ep);
    if (ep == cp)   // if nothing parsed
    {
        do_whatever_for_a_text_token (cp);
        continue;
    }

    while (isspace (*cp))
       ++cp;
    ep = NULL;
    v2 = strtod (cp, &ep);
    if (ep == cp)   // if no float parsed
    {
         handle_single_floating_value (v1);
         continue;
    }
    handle_two_floats (v1, v2);  
 }
查看更多
兄弟一词,经得起流年.
3楼-- · 2019-02-23 08:32

Using C, I would do something like this (untested):

#include <stdio.h>

#define MAX 128

char buf[MAX];
while (fgets(buf, sizeof buf, fp) != NULL) {
    double d1, d2;
    if (buf[0] == '\n') {
        /* saw blank line */
    } else if (sscanf(buf, "%lf%lf", &d1, &d2) != 2) {
        /* buf has the next text token, including '\n' */
    } else {
        /* use the two doubles, d1, and d2 */
    }
}

The check for blank line is first because it's relatively inexpensive. Depending upon your needs:

  1. you might need to increase/change MAX,
  2. you may need to check if buf ends with a newline, if it doesn't, then the line was too long (go to 1 or 3 in that case),
  3. you might need a function that reads full lines from a file, using malloc() and realloc() to dynamically allocate the buffer (see this for more),
  4. you might want to take care of special cases such as a single floating-point value on a line (which I assume is not going to happen). sscanf() returns the number of input items successfully matched and assigned.

I am also assuming that blank lines are really blank (just the newline character by itself). If not, you will need to skip leading white-space. isspace() in ctype.h is useful in that case.

fp is a valid FILE * object returned by fopen().

查看更多
放我归山
4楼-- · 2019-02-23 08:47

Wow, I don't write many parsers in C any more

This has been tested on the OP's input

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef enum {
  scan_blank, scan_label, scan_float
} tokens;

double f1, f2;

char line[512], string_token[sizeof line];

tokens scan(void) {
  char *s;
  for(s = line; *s; ++s) {
    switch(*s) {
      case ' ':
      case '\t':
        continue;
      case '\n':
        return scan_blank;
      case '0': case '1': case '2': case '3': case '4':
      case '5': case '6': case '7': case '8': case '9':
      case '.': case '-':
        sscanf(line, " %lf %lf", &f1, &f2);
        return scan_float;
      default:
        sscanf(line, " %s", string_token);
        return scan_label;
    }
    abort();
  }
  abort();
}

int main(void) {
  int n;
  for(n = 1;; ++n) {
    if (fgets(line, sizeof line, stdin) == NULL)
      return 0;
    printf("%2d %-40.*s", n, (int)strlen(line)-1, line);
    switch(scan()) {
      case scan_blank:
        printf("blank\n");
        break;
      case scan_label:
        printf("label [%s]\n", string_token);
        break;
      case scan_float:
        printf("floats [%lf %lf]\n", f1, f2);
        break;
    }
  }
}
查看更多
登录 后发表回答