I've the following C function in a Windows API project that reads a file and based on the line endings (UNIX, MAC, DOS) it replaces the line endings with the right line-endings for Windows (\r\n
):
// Standard C header needed for string functions
#include <string.h>
// Defines for line-ending conversion function
#define LESTATUS INT
#define LE_NO_CHANGES_NEEDED (0)
#define LE_CHANGES_SUCCEEDED (1)
#define LE_CHANGES_FAILED (-1)
/// <summary>
/// If the line endings in a block of data loaded from a file contain UNIX (\n) or MAC (\r) line endings, this function replaces it with DOS (\r\n) endings.
/// </summary>
/// <param name="inData">An array of bytes of input data.</param>
/// <param name="inLen">The size, in bytes, of inData.</param>
/// <param name="outData">An array of bytes to be populated with output data. This array must already be allocated</param>
/// <param name="outLen">The maximum number of bytes that can be stored in outData.</param>
/// <param name="bytesWritten">A pointer to an integer that receives the number of bytes written into outData.</param>
/// <returns>
/// If no changes were necessary (the file already contains \r\n line endings), then the return value is LE_NO_CHANGES_NEEDED.<br/>
/// If changes were necessary, and it was possible to store the entire output buffer, the return value is LE_CHANGES_SUCCEEDED.<br/>
/// If changes were necessary but the output buffer was too small, the return value is LE_CHANGES_FAILED.<br/>
/// </returns>
LESTATUS ConvertLineEndings(BYTE* inData, INT inLen, BYTE* outData, INT outLen, INT* bytesWritten)
{
char *posR = strstr(inData, "\r");
char *posN = strstr(inData, "\n");
// Case 1: the file already contains DOS/Windows line endings.
// So, copy the input array into the output array as-is (if we can)
// Report an error if the output array is too small to hold the input array; report success otherwise.
if (posN != NULL && posR != NULL)
{
if (outLen >= inLen)
{
strcpy(outData, inData);
return LE_NO_CHANGES_NEEDED;
}
return LE_CHANGES_FAILED;
}
// Case 2: the file contains UNIX line endings.
else if (posN != NULL && posR == NULL)
{
int i = 0;
int track = 0;
for (i = 0; i < inLen; i++)
{
if (inData[i] != '\n')
{
outData[track] = inData[i];
track++;
if (track>outLen) return LE_CHANGES_FAILED;
}
else
{
outData[track] = '\r';
track++;
if (track > outLen) return LE_CHANGES_FAILED;
outData[track] = '\n';
track++;
if (track > outLen) return LE_CHANGES_FAILED;
}
*bytesWritten = track;
}
}
// Case 3: the file contains Mac-style line endings.
else if (posN == NULL && posR != NULL)
{
int i = 0;
int track = 0;
for (i = 0; i < inLen; i++)
{
if (inData[i] != '\r')
{
outData[track] = inData[i];
track++;
if (track>outLen) return LE_CHANGES_FAILED;
}
else
{
outData[track] = '\r';
track++;
if (track > outLen) return LE_CHANGES_FAILED;
outData[track] = '\n';
track++;
if (track > outLen) return LE_CHANGES_FAILED;
}
*bytesWritten = track;
}
}
return LE_CHANGES_SUCCEEDED;
}
However, I feel like this function is very long (almost 70 lines) and could be reduced somehow. I've searched on Google but couldn't find anything useful; is there any function in either the C library or the Windows API that will allow me to perform a string-replace rather than manually searching the string byte-by-byte in O(n) time?
Every character needs looking at precisely one time, not more and not less. The very first line of your code already makes repeated comparisons, as both
strstr
calls start at the same position. You could have used something likeand if this fails, continue from where you ended if you did find an
\r
or, ifposR == NULL
, starting from the top again. But then you made thestrstr
already "look at" every character until the end!Two additional notes:
strstr
because you are looking for a single character; usestrchr
next time;strXXX
functions all assume your input is a properly formed C string: it should end with a terminating0
. However, you already provide the length ininLen
, so you don't have to check for zeroes. If there may or may not be a0
in your input beforeinLen
bytes, you need to take appropriate action. Based on the purpose of this function, I'm assuming you don't need to check for zeroes at all.My proposal: look at every character from the start once, and only take action when it is either an
\r
or an\n
. If the first of these you encounter is an\r
and the next one is an\n
, you're done. (This assumes the line endings are not "mixed".)If you do not return in this first loop, there is something else than
\r\n
, and you can continue from that point on. But you still only have to act on either an\r
or\n
! So I propose this shorter code (and anenum
instead of your defines):There are a few old and rare 'plain text' formats that use other constructions; from memory, something like
\r\n\n
. If you want to be able to sanitize anything, you can add a skip for all\r
s after a single\n
, and the same for the opposite case. This will also clean up any "mixed" line endings, as it will correctly treat\r\n
as well.Here's what I would consider a somewhat simpler code, half as many lines. Of course, as Ben Voigt pointed out, you can't beat O(n) time, so I made no attempt to do so. I didn't use any library functions, because it seems simpler this way, and I doubt that extra function calls could make the code faster.
The biggest differences in my code are that