C++ Get HTML Source

I would like to know how I can download a website's HTML source into a string, without using LibCurl. I have searched online for examples on using Wininet.

Below is an example code I used for Wininet. How would I do the same using Winsock?

    #include "stdafx.h"
#include <windows.h>
#include <wininet.h>
#include <iostream>
#include <string>
#include <stdio.h>
#include <stdlib.h>
using namespace std;

#pragma comment ( lib, "Wininet.lib" )

int main()
{
    HINTERNET hInternet = InternetOpenA("InetURL/1.0", INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0);

    HINTERNET hConnection = InternetConnectA(hInternet, "google.com", 80, " ", " ", INTERNET_SERVICE_HTTP, 0, 0);

    HINTERNET hData = HttpOpenRequestA(hConnection, "GET", "/", NULL, NULL, NULL, INTERNET_FLAG_KEEP_CONNECTION, 0);

    char buf[2048];
    string lol;
    HttpSendRequestA(hData, NULL, 0, NULL, 0);

    DWORD bytesRead = 0;
    DWORD totalBytesRead = 0;
    // http://msdn.microsoft.com/en-us/library/aa385103(VS.85).aspx
    // To ensure all data is retrieved, an application must continue to call the
    // InternetReadFile function until the function returns TRUE and the
    // lpdwNumberOfBytesRead parameter equals zero. 
    while (InternetReadFile(hData, buf, 2000, &bytesRead) && bytesRead != 0)
    {
        buf[bytesRead] = 0; // insert the null terminator.

        puts(buf);          // print it to the screen.
        lol = lol + buf;

        printf("%d bytes read\n", bytesRead);

        totalBytesRead += bytesRead;
    }

    printf("\n\n END -- %d bytes read\n", bytesRead);
    printf("\n\n END -- %d TOTAL bytes read\n", totalBytesRead);

    InternetCloseHandle(hData);
    InternetCloseHandle(hConnection);
    InternetCloseHandle(hInternet);

    cout << "\nThe beginning." << endl << endl << endl;

    cout << lol << endl;


    system("PAUSE");
}

This example of WinSock works for sites without additional paths. How would I grab the HTML of a page like this: (www.website.com/page)

    #include "stdafx.h"
#include <iostream>
#include <winsock2.h>
#include <string>
#include <fstream>
using namespace std;


string get_source()
{
    WSADATA WSAData;
    WSAStartup(MAKEWORD(2, 0), &WSAData);

    SOCKET sock;
    SOCKADDR_IN sin;

    char buffer[1024];

    ////////////////This is portion that is confusing me//////////////////////////////////////////////////
    string srequete = "GET /id/AeroNX/ HTTP/1.1\r\n";
    srequete += "Host: steamcommunity.com\r\n";
    srequete += "Connection: close\r\n";
    srequete += "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\n";
    srequete += "Accept-Language: fr,fr-fr;q=0.8,en-us;q=0.5,en;q=0.3\r\n";
    srequete += "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n";
    srequete += "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3\r\n";
    srequete += "Referer: http://pozzyx.net/\r\n";
    srequete += "\r\n";
    ///////////////////////////////////////////////////////////////////////////////////////////////////////

    size_t requete_taille = srequete.size() + 1;

    char crequete[5000];
    strncpy(crequete, srequete.c_str(), requete_taille);

    int i = 0;
    string source = "";

    sock = socket(AF_INET, SOCK_STREAM, 0);

    sin.sin_addr.s_addr = inet_addr("63.228.223.103"); // epguides.com //why wont it work for 72.233.89.200 (whatismyip.com)
    sin.sin_family = AF_INET;
    sin.sin_port = htons(80); // port HTTP.

    connect(sock, (SOCKADDR *)&sin, sizeof(sin)); // on se connecte sur le site web.
    send(sock, crequete, strlen(crequete), 0); // why do we send the string??


    do
    {
        i = recv(sock, buffer, sizeof(buffer), 0); // le buffer récupère les données reçues.
        source += buffer;
    } while (i != 0);


    closesocket(sock); // on ferme le socket.
    WSACleanup();

    return source;
}

void main()
{
    ofstream fout;
    fout.open("Buffer.txt");
    fout << get_source(); // the string url doesnt matter
    fout.close();
    system("PAUSE");
}

Okay, I see you just need help on one little bit of HTTP, not a breakdown on the whole thing. I'm going to leave my full description for future readers, though, after I give the short answer for you.

Short answer:

In the first line, where you say GET /foo/bar.html HTTP/1.1, the middle part (/foo/bar.html) is the path to the resource. So, for example, if you want to get http://www.myserver.com/foo/bar.html then you put /foo/bar.html there. If you wanted to get http://www.myserver.com/get/my/file.html then the first line of your request would be GET /get/my/file.html HTTP/1.1. The remaining lines of your request do not need to change to get a different resource (although you'll want to change Host: if you get something from a different server entirely, e.g., Host: www.myserver.com).

Full description of HTTP:

Are you trying to get it without using any libraries, just raw sockets? If so, you'll have to implement the HTTP protocol (client side of it anyway), but the good news is HTTP is really easy to learn and almost as easy to implement. :)

To send a request for a page, open a connection to port 80 on the web server. Then send it this:

GET <resource> HTTP/1.1\r\n
Host: <web_server_name>\r\n
Connection: close\r\n
\r\n

Note that I have explicitly put in the \r\n lie breaks to show you. There are two important things about them: 1) you must use \r\n and not just \n in the protocol, and 2) the end of the HTTP header must have a double \r\n\r\n. (For your request, there is no data section, so the end of the header is also the end of your entire request message.)

Replace <resource> with the path to the file you want to get, and <web_server_name> with the DNS name of the web server. For example, if you wanted to retrieve http://www.cc.gatech.edu/~davel/classes/cs3251/summer2011/test/hypertext.html then the <web_server_name> (Host field) is www.cc.gatech.edu and the <resource> is /~davel/classes/cs3251/summer2011/test/hypertext.html.

The web server will send back an HTTP response message on the same socket. If all goes well, you will get a message back that looks something like this:

HTTP/1.1 200 OK\r\n
Date: Mon, 23 May 2005 22:38:34 GMT\r\n
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)\r\n
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT\r\n
ETag: "3f80f-1b6-3e1cb03b"\r\n
Content-Type: text/html; charset=UTF-8\r\n
Content-Length: 131\r\n
Connection: close\r\n
\r\n
<html>
<head>
  <title>An Example Page</title>
</head>
<body>
  Hello World, this is a very simple HTML document.
</body>
</html>

Note the double \r\n\r\n again, which denotes the end of the HTTP header. After that is the data section, which contains the HTML source of the page. I have omitted explicitly showing the line breaks for the data section because they are part of the data itself, not the HTTP protocol (so they don't have to be \r\n). Also note the Content-Length field. It tells you how many bytes long the data section is (the HTML source, in this case), so you can read the correct length from the socket. There is no \r\n at the end of the data section. (The data itself may or may not include a line break at the end. If it does, it will be included in the Content-Length bytes.)

The only mildly hard part is recieving and parsing the HTTP messages. I find the easiest way to recieve HTTP is to read one line at a time from the socket, parsing each header field as you see it (you don't have to handle every field; you can probably ignore many of them). Once you get the blank line, you know the header is done. Then just read the correct number of bytes from the socket for your data payload, as specified by Content-Length. (It's probably a good idea to error check before reading the data section by verifying 1) that you got 200 OK in the first line of the response - something else indicates some kind of error, and 2) that you actually got a Content-Length field somewhere in the header.)

Also, the Connection: close field in the request, which is echoed back in the response says that the server can close the TCP connection after it has sent you the response. If you want to make many requests, you might use Connection: keep-alive instead, but it gets a little more complicated because you have to pay attention to the Connection field in the response then. Technically, the server is allowed to send back Connection: close and close the socket even if you requested a keep-alive. So just going with Connection: close produces simpler code, and is perfectly adequate if you only want one page anyway.

The Wikipedia page for HTTP is of some help, but lacks detail. (I did shamelessly rip my HTTP response example from there, though.) https://en.wikipedia.org/wiki/Http

If someone has a link for a better online reference for HTTP (that's easier to follow than reading the standards document), please feel free to add it / edit this post, or put it in a comment.