I want to print the data between tags from a text file using C.
Input statement : (PERSON) Mark Zuckerberg (/PERSON) is a entrepreneur from (LOCATION) USA (/LOCATION). He is also the CEO of (ORGANIZATION) Facebook (/ORGANIZATION).
Output: Mark Zuckerberg USA Facebook.
My Program code is :
const char* getfield(char* line, int num)
{
const char* tok;
for (tok = strtok(line, "/>");
tok && *tok;
tok = strtok(NULL, "<\n"))
{
if (!--num)
return tok;
}
return NULL;
}
int main()
{
char line[500000];
while (fgets(line, 500000, stdin))
{
char* tmp = strdup(line);
printf(" %s\n", getfield(tmp, 2));
free(tmp);
}
}
It is only printing Mark Zuckerberg. Other data between tags are not showing ? Can someone please help where I went wrong ? I have just started learning file processing in C, so guidance is highly appreciated. Thanks.
EDIT: Please replace "(" by "<" and ")" by "/>".
Your
getfield
don't do what you want I guess. On example string (remplacing parenthesis) yourfor
loop startsstrtok
will cut at the first ">" (strtok
uses any of the characters as delimiter) so the one after the 1st "PERSON". After that you only cut for ">\n" so at end of this tag. With a big enoughnum
it would give (inside the loop):You should alternate searches: search for closing tag (>), then search for opening tag (<): in beetween is the content of 1st tag. Then skip the closing tag and start again the same until the end. Something like:
Note that this code is not very nice. If you have ">" or "<" inside regular text it will go wrong (as do you own code, BTW). And it don't stop properly if string don't ends with a \n.
Note bis: whatever if you need a robust approach you will have to read tags. I mean find a tag (stuff beetween "<" and ">"), then find the corresponding closing tag (same, but with the / and the same content), and then only get the text inside or generate an error.
EDIT: I changed the function so that it returns the
num
th element. You will now have to deal with amain()
function able to call this function several times, with increasing values ofnum
, storing (or printing) the result until getting NULL answer. As home work you will have to find how to manage the main string (line) inmain
so that successive calls are possible (else you will really only get the first tag) :)EDIT:Changed
(
and)
this to<
and>
respectiveley in the code and in the explanation. Thanks Tom.Here is the Solution. Try to Understand the code. Also, feel free to modify it to your needs. Basically, what you need to do is scan for the tags
<>
or/and</>
by scanning for characters<
and>
. When you encounter the character<
increment your index till you encounter the character>
.Once you encounter>
character then start copying the character following the>
character till you encounter another<
character and then repeat the process till you reach the null terminating Character'\0'
.Your
fgets()
call is reading the entire line. Then you callgetfield()
and print the result. You then discard the rest of what you read, try to read more, there isn't any more, and you exit your loop. You need to keep looping as long as you have unprocessed data inline
.Edit: Here's some sample code to get you started:
But note that this isn't a real solution. For one, it will give you the text outside of the tags as well as text inside the tags, so you will need to skip that. For another, it won't really work properly if your input file contains more than one line.