Varnish C VRT variables/functions

2019-07-07 02:19发布

问题:

I'm starting to pick up varnish and have come across references to VRT functions in C code in our configuration (and examples on the net) that I can't find documentation on (that I understand, my C knowledge is non-existant). This is the best I can find but it's just the prototypes: http://fossies.org/dox/varnish-4.0.2/vrt__obj_8h.html#a7b48e87e48beb191015eedf37489a290

So here's an example we use (and which seems to be copypasta from the net as I've found it plenty of times):

C{
  #include <ctype.h>
  static void strtolower(char *c) {
    for (; *c; c++) {
      if (isupper(*c)) {
        *c = tolower(*c);
      }
    }
  }
}C

sub vcl_recv {
...stuff....
if (req.url ~ "<condition>" && (<another if condition>)) {
  C{
    strtolower((char *)VRT_r_req_url(sp));
  }C
}

So my questions are:

  1. What is sp here? Where does it come from? It's not defined anywhere nor can I find anything about it
  2. What does VRT_r_req_url do? Why is it VRT_ prefixed and what is the r (I see there are VRT_l_ functions too). What is this struct it gets data from?
  3. Are all of these VRT functions parallels to get variables equivalent to say req.url outside of a C block?
  4. Is there documentation somewhere that says what all of these do? For example I've seen this one a few times as well:

    sub detectmobile {
      C{
        VRT_SetHdr(sp, HDR_BEREQ, "\020X-Varnish-TeraWurfl:", "no1", vrt_magic_string_end);
      }C
     }
    

    So what is HDR_BEREQ and vrt_magic_string_end here?

回答1:

This is going to be a pretty long answer, because there's a fair bit to say regarding your question. First, some nits about the C code in your VCL:

  1. Implementing strtolower is probably unnecessary; the standard vmod has a std.tolower function. If you are running Varnish 3, you should use this instead. (That said, the existence of this seems to imply you might be using Varnish 2, so who knows?)
  2. Your call to VRT_SetHdr seems unnecessary. I don't see any difference between that and set bereq.http.X-Varnish-TeraWurfl = "no1";

Some of my answers may not be super accurate because it's unclear what version of Varnish you're using, but I'm going to guess

Now, to get at your questions:

  1. What is sp here? Where does it come from? It's not defined anywhere nor can I find anything about it

sp is idiomatic in Varnish to mean session pointer. It is of type struct sess and contains some context about an in-progress request. Depending on what version of Varnish you're using, this may have more or less context, so it's hard to really define the scope. In Varnish 2, the session contains everything from workspace to request state (and much in between). Varnish 4 has split this out considerably.

I'm guessing that you're using Varnish 2 or Varnish 3. In Varnish 4, you would be passing around something called ctx.

In any event, from the perspective of the configuration, the only thing you really need to know about sp is that it is always the first argument to any VRT function.

  1. What does VRT_r_req_url do? Why is it VRT_ prefixed and what is the r (I see there are VRT_l_ functions too). What is this struct it gets data from?

VRT stands for VCL RunTime. It is a set of functions that are implemented inside the Varnish binary itself. The function signatures and some opaque structures are exposed to VCL through a header file. The VCL compiler uses this header file along with the output of the C code it generates from your VCL to create a shared object that is loadable into Varnish. In addition, there is a TCL script (it's Python in Varnish 4) that associates different VCL built-ins and variables with VRT functions.

The r and l stand for right and left and this has to do with where a variable is evaluated in an expression. Because VCL doesn't allow any kind of "complex" expressions (like addition or subtraction; it's certainly nowhere close to Turing complete unless you set max_restarts to an unbounded value), there are really only two places variables can be accessed: on the right-hand side, or the left-hand side. For instance:

set req.url = req.url + "/"

will compile to

VRT_l_req_url(sp, VRT_r_req_url(sp), "/", vrt_magic_string_end);

The access to req.url on the left-hand side causes the compiler to call VRT_l_req_url, and the access on the right-hand side causes it to use VRT_r_req_url.

An easier way to think about it might be l means "set" and r means "get" (or "read", if you prefer). But it really means left and right.

To tie this into your code snippet:

strtolower((char *)VRT_r_req_url(sp));

VRT_r_req_url returns a const char * representing the value of req.url. This pointer is being cast to char * to remove the const qualifier. (This cast is a bug in your configuration.) The cast pointer is sent to strtolower, which then lowercases the string.

This is buggy for a few reasons. VRT_r_req_url gave you a const char * back, so you really aren't supposed to modify it. I don't think this will break anything, but it is a violation of the API contract you are given. Furthermore, the way you are given to write to req.url is via the VRT_l_req_url interface -- not directly in your strtolower implementation. Therefore, the correct way to do this would be to use either the std.tolower vmod, or to make a copy of the URL in the session workspace, to modify that copy, and then store it back with VRT_l_req_url.

As an aside, the strtolower implementation does not need the if (isupper(*c)) check. This check only serves to confuse the processor's branch predictor. tolower(3) in basically every implementation uses a branchless lookup table, and characters (like numbers) without a lowercase equivalent will not be converted.

  1. Are all of these VRT functions parallels to get variables equivalent to say req.url outside of a C block?

Yes. All VRT functions implement either function calls or variable lookups. But I think you mean "inside of a C block".

  1. Is there documentation somewhere that says what all of these do? For example I've seen this one a few times as well:
sub detectmobile {
  C{
    VRT_SetHdr(sp, HDR_BEREQ, "\020X-Varnish-TeraWurfl:", "no1", vrt_magic_string_end);
  }C
 }

So what is HDR_BEREQ and vrt_magic_string_end here?

There is some documentation, but a fair bit of it requires source diving. If you can say what version of Varnish you're using, I can point you to some files that might be helpful for understanding what's going on.

HDR_BEREQ tells VRT_SetHdr to use a particular workspace that contains the request that will be sent to the backend.

vrt_magic_string_end is a sentinel. Basically all of the functions that can take a string argument can also take a bunch of strings concatenated together. Varnish solves this problem by using varargs for these functions, passing multiple char * arguments to the function. Typically, if you have a function with a variable number of arguments that are all pointers, you'd just use a NULL pointer to signify that no more arguments are available. However, it is perfectly valid for a NULL value to be passed in to many of these functions. vrt_magic_string_end is a constant pointer value that cannot be confused for any other pointer, and therefore is a safe method for determining when no more arguments were passed to the function.

Consider a log call like:

log req.url + " " + req.http.Wookies + "ha!"

This call would be converted to:

VRT_log(sp, VRT_r_req_url(sp), " ", VRT_GetHdr(sp, HDR_REQ, "\10Wookies:"), "ha!", vrt_magic_string_end);

If we did not use vrt_magic_string_end, and instead relied on NULL, we would never be able to figure out that "ha!" would also need printing.

Anyway, there's a lot of response here. I hope it's useful; please feel free to ask questions if you have more.

Edit: Follow-up Questions

  1. So are all operations outside a C block actually just calling the C functions under the covers, and thus are all the functions and variables in VCL matched by a VRT function?

Yes, effectively. From a technical perspective, VCL doesn't really have variables (or arguably functions either). It's not really a programming language in a strict sense. It's simply a language for tweaking the Varnish HTTP state machine.

  1. In VRT_SetHdr why do you specify a workspace but in VRT_r_req_url you don't? As in do I run VRT_r_bereq_url to get a backend url or do I need to call it with a workspace as well to get that, something like VRT_r_req_url(sp, BEREQ) (or is this just not a valid operation because you never look up a backend URL)?
  2. How do I know when I need to pass a workspace or not and what are they all are (i.e. HDR_BEREQ is obviously back end request headers, but what other workspaces are there)?

The answers to these are related, so I'll answer them both in one place.

This is because the place to resolve req.url from is embedded in the function name, and this is due to some general weirdness in how the VCL compiler does things. In HTTP, the URL isn't really part of headers, but Varnish sort of treats it like it is. Similarly, things like an beresp.ttl or req.hash_always_miss are not headers. When the bits we're looking at aren't headers, we need to implement them specially.

Indeed, finding where req.url is implemented is hard because of some rather unfortunate macro use without any comments. You're interested in cache_vrt_var.c:64-95.

Anyway, headers are dynamic, and you don't know where they'll be (if they exist at all) until you get a request. When accessing headers through any of the interfaces for various states (req.http.*, bereq.http.*, beresp.http.*, and resp.http.*), you need to resolve them for that specific state. To reduce code duplication, any header read or set via these methods goes through VRT_GetHdr or VRT_SetHdr, respectively. Because these functions are shared for all VCL states, you pass a hint to them to tell them whether you're talking about req, bereq, beresp, or resp headers. So as you can probably imagine, you have HDR_REQ, HDR_BEREQ, HDR_BERESP, and HDR_RESP.

  1. For the sake of learning (ignoring that there is a vmod for this) would you mind updating your post to show the best way to implement the strtolower function avoiding the modifying a const via a dodgy cast and the passing of an incorrect type to the tolower function?

Honestly, you can't really do it safely because the VCL compiler is given an opaque type for struct sess. Without making a VMOD, the best you can do is:

#include <ctype.h>
static void 
strtolower(char *c)
{
  while (*c != '\0) {
    *c++ = tolower(*c);
  }
}

If you compile with C99 support, you could possibly do this:

C{
  #include <ctype.h>
  static void 
  strtolower(const char *c, char *obuf)
  {
    while (*c != '\0') {
      *obuf++ = tolower(*c++);
    }
    *obuf = '\0';
  }
}C

...

if (req.url ~ "[A-Z]") {
  C{
    const char *url = VRT_r_req_url(sp);
    size_t urllen = strlen(url) + 1;
    char obuf[urllen];

    strtolower(url, obuf, urllen);
    VRT_l_req_url(sp, obuf, vrt_magic_str_end);
  }C
}

Honestly, this implementation isn't great either. You risk blowing out the stack doing this when you get a long URL, and you don't want to malloc inside of VCL. The actual strtolower implementation doesn't do any bounds checking; it just requires you to have a buffer large enough to hold the string. These are all solvable problems, but I really don't want to spend a ton of time on it precisely because it's the wrong way to do it. This is the exact reason why VMODs were created.

You can see the standard strtoupper/strtolower implementation is significantly different: it reserves space from the workspace, copies to the workspace buffer, and then releases the space it didn't use.

(P.S. I got rid of the undefined behavior comments because I realized that the tolower(3) manpage I was quoting meant that the input must be representable in an unsigned char. This is because tolower(3) takes an integer argument; the value you pass could fall out of range. So that was bad information, and I've retracted that.)