i am trying to read UTF8 text from a text file, and then print some of it to another file. I am using Linux and gcc compiler. This is the code i am using:
#include <stdio.h>
#include <stdlib.h>
int main(){
FILE *fin;
FILE *fout;
int character;
fin=fopen("in.txt", "r");
fout=fopen("out.txt","w");
while((character=fgetc(fin))!=EOF){
putchar(character); // It displays the right character (UTF8) in the terminal
fprintf(fout,"%c ",character); // It displays weird characters in the file
}
fclose(fin);
fclose(fout);
printf("\nFile has been created...\n");
return 0;
}
It works for English characters for now.
If you do not wish to use the wide options, experiment with the following:
Read and write bytes, not characters. Also known as, use binary, not text.
fgetc effectively gets a byte from a file, but if the byte is greater than 127, try treating it as a int instead of a char. fputc, on the other hand, silently ignores putting a char > 127. It will work if you use an int rather than char as the input.
Also, in the open mode, try using binary, so try rb & wb rather than r & w
Instead of
use
The second
fprintf()
does not contain a space after%c
which is what was causing out.txt to display weird characters. The reason is thatfgetc()
is retrieving a single byte (the same thing as an ASCII character), not a UTF-8 character. Since UTF-8 is also ASCII compatible, it will write English characters to the file just fine.putchar(character)
output the bytes sequentially without the extra space between every byte so the original UTF-8 sequence remained intact. To see what I'm talking about, tryIf you want to write UTF-8 characters with the space between them to out.txt, you would need to handle the variable length encoding of a UTF-8 character.
The C-style solution is very insightful, but if you'd consider using C++ the task becomes much more high level and it does not require you to have so much knowledge about utf-8 encoding. Consider the following:
This code worked for me: