I am supposed to get an input line that can be in of any of the following formats:
- There must be space between word 1 and word 2.
- There must be a comma between word 2 and word 3.
- Spaces are not a must between word 2 and word 3 — but any number of spaces is possible.
How can I separate 1, 2 and 3 word cases and put the data into the correct variables?
word1
word1 word2
word1 word2 , word3
word1 word2,word3
I thought about something like:
sscanf("string", "%s %s,%s", word1, word2, word3);
but it doesn't seem to work.
I use strict C89.
int n = sscanf("string", "%s %[^, ]%*[, ]%s", word1, word2, word3);
The return value in n
tells you how many assignments were made successfully. The %[^, ]
is a negated character-class match that finds a word not including either commas or blanks (add tabs if you like). The %*[, ]
is a match that finds a comma or space but suppresses the assignment.
I'm not sure I'd use this in practice, but it should work. It is, however, untested.
Maybe a tighter specification is:
int n = sscanf("string", "%s %[^, ]%*[,]%s", word1, word2, word3);
The difference is that the non-assigning character class only accepts a comma. sscanf()
stops at any space (or EOS, end of string) after word2
, and skips spaces before assigning to word3
. The previous edition allowed a space between the second and third words in lieu of a comma, which the question does not strictly allow.
As pmg suggests in a comment, the assigning conversion specifications should be given a length to prevent buffer overflow. Note that the length does not include the null terminator, so the value in the format string must be one less than the size of the arrays in bytes. Also note that whereas printf()
allows you to specify sizes dynamically with *
, sscanf()
et al use *
to suppress assignment. That means you have to create the string specifically for the task at hand:
char word1[20], word2[32], word3[64];
int n = sscanf("string", "%19s %31[^, ]%*[,]%63s", word1, word2, word3);
(Kernighan & Pike suggest formatting the format string dynamically in their (excellent) book 'The Practice of Programming' or Amazon The Practice of Programming 1999.)
Just found a problem: given "word1 word2 ,word3"
, it doesn't read word3
. Is there a cure?
Yes, there's a cure, and it is actually trivial, too. Add a space in the format string before the non-assigning, comma-matching conversion specification. Thus:
#include <stdio.h>
static void tester(const char *data)
{
char word1[20], word2[32], word3[64];
int n = sscanf(data, "%19s %31[^, ] %*[,]%63s", word1, word2, word3);
printf("Test data: <<%s>>\n", data);
printf("n = %d; w1 = <<%s>>, w2 = <<%s>>, w3 = <<%s>>\n", n, word1, word2, word3);
}
int main(void)
{
const char *data[] =
{
"word1 word2 , word3",
"word1 word2 ,word3",
"word1 word2, word3",
"word1 word2,word3",
"word1 word2 , word3",
};
enum { DATA_SIZE = sizeof(data)/sizeof(data[0]) };
size_t i;
for (i = 0; i < DATA_SIZE; i++)
tester(data[i]);
return(0);
}
Example output:
Test data: <<word1 word2 , word3>>
n = 3; w1 = <<word1>>, w2 = <<word2>>, w3 = <<word3>>
Test data: <<word1 word2 ,word3>>
n = 3; w1 = <<word1>>, w2 = <<word2>>, w3 = <<word3>>
Test data: <<word1 word2, word3>>
n = 3; w1 = <<word1>>, w2 = <<word2>>, w3 = <<word3>>
Test data: <<word1 word2,word3>>
n = 3; w1 = <<word1>>, w2 = <<word2>>, w3 = <<word3>>
Test data: <<word1 word2 , word3>>
n = 3; w1 = <<word1>>, w2 = <<word2>>, w3 = <<word3>>
Once the 'non-assigning character class' only accepts a comma, you can abbreviate that to a literal comma in the format string:
int n = sscanf(data, "%19s %31[^, ] , %63s", word1, word2, word3);
Plugging that into the test harness produces the same result as before. Note that all code benefits from review; it can often (essentially always) be improved even after it is working.
#include <stdio.h>
#include <string.h>
int main ()
{
char str[] ="word1 word2,word3";
char* pch;
printf ("Splitting string \"%s\" into tokens:\n",str);
pch = strtok(str," ,");
while (pch != NULL)
{
printf ("%s\n",pch);
pch = strtok (NULL, " ,.-");
}
return 0;
}
Abstract:
The answer is divided to three parts. The first part is answering the general question of "properly using sscanf", by describing the benefits of using sscanf, and when it is preferable to use sscanf. The second part is answering the specific part of the question. The third part is crucial to the general and the specific parts of the question, and is describing as fully as I could, and as simple as I could, the internal work of sscanf.
part 1 The advantage in using sscanf: Using sscanf is dividing a big problem
(The original input line) to smaller problems (the output tokens) at once.
If the line rules are well defined (For example The line rules in the question are well defined: There must be space between word 1 and word There must be a comma between word 2 and word 3. Spaces are not a must between word 2 and word 3 — but any number of spaces is possible.) than sscanf can bring as a Yes/No answer to the question "does the current read line stands in the line rules?" (without trying to analyze and understand what is typed in the input file, or what was intended to be typed there), and it can give also the output tokens of the line; both immediately.
For this purpose, of separation the input string to tokens, it is convenient to use the %c. We should remember that by default sscanf skips over whitespace characters (spaces, tabs, and newlines) but not in the case of %c, where sscanf reads the whitespace and assign it as the value of the corresponding character variable.
Using strtok instead, is truly more general and flexible, but it does not have the advantages of reading a whole line at once, and using a rich lexical analyzing (i.e. %d, %f, %c *,^ and all the vocabulary of sscanf). And in case that the line rules are well defined, and a Yes/No answer, to the question "does the current read line stands in the line rules?";is enough than these advantages may be used.
Part 2 answering the specific question: here is an sscanf code line that seems to work, and below is an explanation of the code line. (The number 100 is assumed to be bigger than the maximum input line size.)
The call:
n = sscanf(" sssfdf wret , 123 fdsgs fdgsdfg",
"%100[^ ]%c%100[^,] %c %100[^\0]", s1, &ch1, s2, &ch2, s3);
will result in:
s1 = ""sssfdf";
ch1=' ';
s2=""wret ";
ch2=',';
s3=""123 fdsgs fdgsdfg";
Read the minimum of 100 characters or all the characters until the first space to s1. (Remember that the condition is that there should be exactly one space between the first word to the second word).
Read the next character to ch1 (later we can check that ch1 has the value of space).
Read the minimum of 100 characters or all the characters until the first comma to s2, s2 may contain spaces that will be removed later. (There should be a comma between the second word to the third word, with optional space before and after the comma).
Note that %100[^ ]%c%100[^,] comes with no spaces, because a space before the first %c will cause the character after the space to be erad to ch1, a space before %100[^,] will enable more than one space before the first word and the second word.
Read the next character to ch2 (later we can check that ch2 has the value of comma).
Read the remaining of the input string to s3 (Read from first none whitespace until the string terminator character).
What is left is to check the validity of s1,s2 and s3 (And test the values of ch1 and ch2 to be apace and comma).
Part 3 the internal work of sscanf: sscanf() function, begins reading its format string a character at a time. There are 3 possible values of this character, a whitespace, '%' or otherwise.
If the next character is not a whitespace, and not '%', than it starts reading the input string
1.1 If the next character in the input string is not the character in the
format string, sscanf stops it's work and returns to the caller with the
number of parameters it read so far.
example:
n = sscanf(" 2 22.456","2%f",&FloatArg); /* n is 0 */
1.2 If the next character in the input string is the character in the format
string, than sscanf continues reading the next character from the format
string.
n = sscanf("2 22.456","2%f",&FloatArg); // n is 1 FloatArg=22.456
If the next character in the format string is % than sscanf skips over
whitespaces and waits to read a string in the % format. For example for %f,
it waits to read and input in the format:
[+/-][IntDigiT1]...[IntDigiTn]<....>.
examples: 31.25, 32., 3
2.1 If sscanf did not find that format, it returns with the number of
arguments it had read so far.
Example:
n = sscanf("aaa","%f",&FloatArg); // n = 0
2.2 If sscanf read at least one digit, or a series of digits followed by a
'.', than when it encounters a nondigit, It then concludes that it has
reached the end of the float. sscanf() places the nondigit back in the
input, and assigns the value read to the floating point variable.
Example1:
n = sscanf("2 22.456","2%f",&FloatArg); // FloatArg is 22.456
Example2:
n = sscanf("22.456","2%f",&FloatArg); // FloatArg is 2.456
If the next character in the format string is a whitespace, it means to skip
over any whitespace before the next input character.
A. Reading characters (%c): If the next input character is a whitespace (for example a space), a space is assigned to the indicated variable.
B. Reading strings (%s): Any character other than whitespace is acceptable,
so scanf() skips whitespace to the first non-whitespace character and then saves up non-whitespace characters until hitting whitespace again. sscanf adds '\0', the string terminator to the end of the assigned string variable.
C. The answer do no enter to the format % variations. [=%[*][width][modifiers]type=]. A good description of this part is at http://docs.roxen.com/(en)/pike/7.0/tutorial/strings/sscanf.xml
Note that the %[characters] in the link above, is used in the answer to the private question, and enables string flexible manipulations.
D. The above is what I found during searching in the internet and testing in Dev-C++ 5.11, various strings, it is not promised to be complete, constructive comments, will be accepted with thanks, and will help me to improve the answer.
This is beyond the scope of scanf and friends, to be perfectly honest; in addition to the answers of "write your own simple parser", you could to invest in yacc to parse the grammer (the lexer is left as an exercise for the reader):
line: oneword | twowords | threewords;
oneword: word;
twowords: word word;
threewords: word word word;
word: STRING;
This may be overkill for you here, but if you ever need to parse even more than marginally complex formats, it's a lifesaver.