Django QuerySet vs Raw Query performance

2019-04-12 21:49发布

问题:

I have noticed a huge timing difference between using django connection.cursor vs using the model interface, even with small querysets. I have made the model interface as efficient as possible, with values_list so no objects are constructed and such. Below are the two functions tested, don't mind the spanish names.

def t3():
    q = "select id, numerosDisponibles FROM samibackend_eventoagendado LIMIT 1000"
    with connection.cursor() as c:
        c.execute(q)
        return list(c)

def t4():
    return list(EventoAgendado.objects.all().values_list('id','numerosDisponibles')[:1000])

Then using a function to time (self made with time.clock())

r1 = timeme(t3); r2 = timeme(t4)

The results are as follows: 0.00180384529631 and 0.00493390727024 for t3 and t4

And just to make sure the queries are and take the same to execute:

connection.queries[-2::]

Yields:

[
    {u'sql': u'select id, numerosDisponibles FROM samibackend_eventoagendado LIMIT 1000',  u'time': u'0.002'},
    {u'sql': u'SELECT `samiBackend_eventoagendado`.`id`, `samiBackend_eventoagendado`.`numerosDisponibles` FROM `samiBackend_eventoagendado` LIMIT 1000', u'time': u'0.002'}
]

As you can see, two exact queries, returning two exact lists (performing r1 == r2 returns True), takes totally different timings (difference gets bigger with a bigger query set), I know python is slow, but is django doing so much work behind the scenes to make the query that slower? Also, just to make sure, I have tried building the queryset object first (outside the timer) but results are the same, so I'm 100% sure the extra time comes from fetching and building the result structure. I have also tried using the iterator() function at the end of the query but that doesn't help neither. I know the difference is minimal, both execute blazingly fast, but this is being bencharked with apache ab, and this minimal difference, when having 1k concurrent requests, makes day and light.

By the way, I'm using django 1.7.10 with mysqlclient as the db connector.

EDIT: For the sake of comparison, the same test with a 11k result query set, the difference gets even bigger (3x slower, compared to the first one where it is around 2.6x slower)

r1 = timeme(t3); r2 = timeme(t4)
0.0149241530889
0.0437563529558

EDIT2: Another funny test, if I actually convert the queryset object to it's actual string query (with str(queryset.query)), and use it inside a raw query instead, I get the same good performance as the raw query, by the execption that using the queryset.query string sometimes gives me an actual invalid SQL query (ie, if the queryset has a filter on a date value, the date value is not escaped with '' on the string query, giving an sql error when executing it with a raw query, this is another mystery)

-- EDIT3: Going through the code, seems like the difference is made by how the result data is retrieved, for a raw query set, it simply calls iter(self.cursor) which I believe when using a C implemented connector will run all in C code (as iter is also a built in), while the ValuesListQuerySet is actually a python level for loop with a yield tuple(row) statement, which will be quite slow. I guess there's nothing to be done in this matter to have the same performance as the raw query set :'(. If anyone is interested, the slow loop is this one:

for row in self.query.get_compiler(self.db).results_iter():
    yield tuple(row)

-- EDIT 4: I have come with a very hacky code to convert a values list query set into usable data to be sent to a raw query, having the same performance as running a raw query, I guess this is very bad and will only work with mysql, but, the speed up is very nice while allowing me to keep the model api filtering and such. What do you think? Here's the code.

def querysetAsRaw(qs):
    q = qs.query.get_compiler(qs.db).as_sql()
    with connection.cursor() as c:
        c.execute(q[0], q[1])
        return c

回答1:

The answer was simple, update to django 1.8 or above, which changed some code that no longer has this issue in performance.