How to parse data between tags from a text file in

2019-06-14 20:41发布

I want to print the data between tags from a text file using C.

Input statement : (PERSON) Mark Zuckerberg (/PERSON) is a entrepreneur from (LOCATION) USA (/LOCATION). He is also the CEO of (ORGANIZATION) Facebook (/ORGANIZATION).

Output: Mark Zuckerberg USA Facebook.

My Program code is :

    const char* getfield(char* line, int num)
    {
        const char* tok;
        for (tok = strtok(line, "/>");
                tok && *tok;
                tok = strtok(NULL, "<\n"))
        {
            if (!--num)
                return tok;
        }
        return NULL;
    }

    int main()
    {
        char line[500000];
        while (fgets(line, 500000, stdin))
        {
            char* tmp = strdup(line);
            printf(" %s\n", getfield(tmp, 2));
            free(tmp);
        }
    }

It is only printing Mark Zuckerberg. Other data between tags are not showing ? Can someone please help where I went wrong ? I have just started learning file processing in C, so guidance is highly appreciated. Thanks.

EDIT: Please replace "(" by "<" and ")" by "/>".

3条回答
Juvenile、少年°
2楼-- · 2019-06-14 20:58

Your getfield don't do what you want I guess. On example string (remplacing parenthesis) your for loop starts strtok will cut at the first ">" (strtok uses any of the characters as delimiter) so the one after the 1st "PERSON". After that you only cut for ">\n" so at end of this tag. With a big enough num it would give (inside the loop):

<PERSON
 Mark Zuckerberg 
/PERSON> is a entrepreneur from 
LOCATION> USA 
/LOCATION>. He is also the CEO of 
ORGANIZATION> Facebook 
/ORGANIZATION>

You should alternate searches: search for closing tag (>), then search for opening tag (<): in beetween is the content of 1st tag. Then skip the closing tag and start again the same until the end. Something like:

char *gf(char *line, int num) {
  char *n1, *n2;
  // comments are for the 1st loop
  // search end of 1st tag (opening)
  n1 = strtok(line, ">\n");
  while(n1) {
    // search begin of 2nd tag (correp. closing)
    n2 = strtok(NULL, "<");
    // this one is good, shall we return it?
    if (num == 0) {
      return(n2);
    }
    printf("Found: %s\n", n2);
    // search end 2nd tag (have to skip it)
    n1 = strtok(NULL, ">\n");
    // search end of 3rd tag (opening), then loop (same situation)
    n1 = strtok(NULL, ">\n");
  }
  return NULL;
}

Note that this code is not very nice. If you have ">" or "<" inside regular text it will go wrong (as do you own code, BTW). And it don't stop properly if string don't ends with a \n.

Note bis: whatever if you need a robust approach you will have to read tags. I mean find a tag (stuff beetween "<" and ">"), then find the corresponding closing tag (same, but with the / and the same content), and then only get the text inside or generate an error.

EDIT: I changed the function so that it returns the numth element. You will now have to deal with a main() function able to call this function several times, with increasing values of num, storing (or printing) the result until getting NULL answer. As home work you will have to find how to manage the main string (line) in main so that successive calls are possible (else you will really only get the first tag) :)

查看更多
Fickle 薄情
3楼-- · 2019-06-14 21:02

EDIT:Changed ( and ) this to < and > respectiveley in the code and in the explanation. Thanks Tom.

Here is the Solution. Try to Understand the code. Also, feel free to modify it to your needs. Basically, what you need to do is scan for the tags <> or/and </> by scanning for characters < and >. When you encounter the character < increment your index till you encounter the character >.Once you encounter > character then start copying the character following the > character till you encounter another < character and then repeat the process till you reach the null terminating Character '\0'.

#include<stdio.h>
//#pragma warning(disable : 4996)

void removeTags(char inpData[], int dataLen);

int main()
{
   char letter, fileData[400];
   int numLetters;
   FILE *pfile;
   pfile = fopen("test.txt", "r");

   if (pfile == NULL)
   {
     printf("Error!Can not open file");
   }
   else 
   {
    numLetters = 0;
    while ((letter = fgetc(pfile)) != EOF) 
    {
        fileData[numLetters] = letter;
        numLetters++;
    }
    fileData[numLetters] = '\0';
    printf("File Data:\n\n");
    printf("%s", fileData);
    printf("\nRemoving Tags.....\n");
    removeTags(fileData,numLetters);
}

return 0;
}

void removeTags(char inpData[],int inpLen)
{
   char character,temp[400];
   int index = 0,tindex=0;

   while (inpData[index] != '\0')
   {
      if ((inpData[index] >= 'A' && inpData[index] <= 'Z') || (inpData[index] >= 'a' && inpData[index] <= 'z') || inpData[index] == ' ' || inpData[index] == '.')
      {
        temp[tindex] = inpData[index];
        index++;
        tindex++;
      }
      else if (inpData[index] == '<')
      {
        while (inpData[index] != '>')
        {
            index++;
        }
        index++;
        temp[tindex] = ' ';
        if (tindex > 0)
        {
            tindex++;
        }

    }
    else
    {
        break;
    }
}
temp[tindex] = '\0';
printf("%s", temp);
}
查看更多
Summer. ? 凉城
4楼-- · 2019-06-14 21:12

Your fgets() call is reading the entire line. Then you call getfield() and print the result. You then discard the rest of what you read, try to read more, there isn't any more, and you exit your loop. You need to keep looping as long as you have unprocessed data in line.

Edit: Here's some sample code to get you started:

int main()
{
    char line[500000];
    while (fgets(line, 500000, stdin))
    {
        char *arg = line;
        const char *tok;
        while ((tok = getfield(arg, 2)) != NULL) {
            printf("%s\n", tok);
            arg = NULL;
        }
    }
}

But note that this isn't a real solution. For one, it will give you the text outside of the tags as well as text inside the tags, so you will need to skip that. For another, it won't really work properly if your input file contains more than one line.

查看更多
登录 后发表回答