Python redis and celery too many clients, differen

2019-07-28 21:00发布

I am currently working on an app, which has to process several long running tasks. I am using python 3, flask, celery, redis.

I have a working solution on localhost, but on heroku there are many errors and every execution of the app triggers everytime a different set of errors. I know it cant be random so I am trying to figure out where to start looking.

I have a feeling something must be wrong with redis and I am trying to understand what clients are and where they come from, but I am not able to find an official documentation or explanation on this topic.

Question:

If the redis server is started (even on localhost) many clients are connected, eventough I haven't done anything. On heroku (I am using heroku-redis) I always have 6 clients, on localhost 11 clients.

I have done some research and I am able to display them with:

if 'DYNO' in os.environ:
    redis_db = redis.StrictRedis(host='HOST', port=15249, password='REDISDBPW')
else:
    redis_db = redis.StrictRedis()

# see what keys are in Redis
all_keys = redis_db.keys()
print (all_keys)

all_clients = redis_db.client_list()
print (all_clients)

I see all these clients but the information there doesn't help me at all. What are they? Why are they there? Where are they coming from?

All the heroku redis add-ons have a client limit, so I need to understand and optimize this. First I thought clientsnumber == tasknumber, but thats not it.

In total I have 12 tasks defined, but I am testing now with 2 tasks (both finish in less than 30 sec.).

When I execute the tasks on localhost the clients increase, from 11 to 16. If I execute once again from 16 to 18 and after this they always stay at 18 doesnt matter how often I execute the tasks.

So what is going on here? I have 2 tasks, why the clients increase from 11 to 16 and then from 16 to 18? Why are they not closed after the task is finished?

I am struggling with the whole issue now for a few days (eventhough it always works perfectly on localhost) so any help or ideas are welcome. I need to start looking somewhere, so currently I am trying to understand the clients.

EDIT:

I installed flower and tryed to monitor the 2 tasks on localhost, everything looks good. It processes two tasks and both succeed in a few seconds. The return value is correct (but it always worked great on localhost).

Still the issue is, after I started flower the amount of clients jumped to 30. I still have no clue: what are clients? With the amount of client I generate I would need a 100$ add-on to just process two tasks, which need a few seconds to finish, this cant be true, I am still thinking something is wrong with redis, even on localhost.

My redis setup is pretty simple:

if 'DYNO' in os.environ:
    app.config['CELERY_BROKER_URL'] = 'redis://[the full URL from the redis add-on]'
    app.config['CELERY_RESULT_BACKEND'] = 'redis://[the full URL from the redis add-on]'
else:
    app.config['CELERY_BROKER_URL'] = 'redis://localhost:6379/0'
    app.config['CELERY_RESULT_BACKEND'] = 'redis://localhost'

celery = Celery(app.name, broker=app.config['CELERY_BROKER_URL'], backend=app.config['CELERY_RESULT_BACKEND'])

Here is an example of a task:

@celery.task(bind=True)
def get_users_deregistrations_task(self, g_start_date, g_end_date):

    start_date = datetime.strptime(g_start_date, '%d-%m-%Y')
    end_date = datetime.strptime(g_end_date, '%d-%m-%Y')

    a1 = db_session.query(func.sum(UsersTransactionsVK.amount)).filter(UsersTransactionsVK.date_added >= start_date, UsersTransactionsVK.date_added <= end_date, UsersTransactionsVK.payed == 'Yes').scalar()
    a2 = db_session.query(func.sum(UsersTransactionsStripe.amount)).filter(UsersTransactionsStripe.date_added >= start_date, UsersTransactionsStripe.date_added <= end_date, UsersTransactionsStripe.payed == 'Yes').scalar()
    a3 = db_session.query(func.sum(UsersTransactions.amount)).filter(UsersTransactions.date_added >= start_date, UsersTransactions.date_added <= end_date, UsersTransactions.on_hold == 'No').scalar()

    if a1 is None:
        a1 = 0

    if a2 is None:
        a2 = 0

    if a3 is None:
        a3 = 0

    amount = a1 + a2 + a3

    return {'some_value' : amount}

# Selects user deregistrations between selected dates
@app.route('/get-users-deregistration', methods=["POST"])
@basic_auth.required
@check_verified
def get_users_deregistrations():
    if request.method == "POST":

        # init task
        task = get_users_deregistrations_task.apply_async([session['g_start_date'], session['g_end_date']])
        return json.dumps({}), 202, {'Location': url_for('taskstatus_get_users_deregistrations', task_id=task.id)}

@app.route('/status/<task_id>')
def taskstatus_get_users_deregistrations(task_id):
    task = get_users_deregistrations_task.AsyncResult(task_id)
    if task.state == 'PENDING':
        response = {
            'state': task.state,
            'current': 0,
            'total': 1,
            'status': 'Pending...'
        }
    elif task.state != 'FAILURE':
        response = {
            'state': task.state,
            'current': task.info['current'],
            'total': task.info['total'],
            'status': 'Finished',
            'statistic': task.info['statistic'],
            'final_dataset': task.info     
        }
        if 'result' in task.info:
            response['result'] = task.info['result']
    else:
        print ('in else')
        # something went wrong in the background job
        response = {
            'state': task.state,
            'current': 1,
            'total': 1,
            'status': str(task.info),  # this is the exception raised
        }
    return json.dumps(response)

EDIT:

Here is my procfile for heroku:

web: gunicorn stats_main:app
worker: celery worker -A stats_main.celery --loglevel=info

EDIT

I am thinking the issue might be the connection pool (on the redis side), which I am not using properly.

I have also found some configurations for celery and added them:

celery = Celery(app.name, broker=app.config['CELERY_BROKER_URL'], backend=app.config['CELERY_RESULT_BACKEND'], redis_max_connections=20, BROKER_TRANSPORT_OPTIONS = {
    'max_connections': 20,
}, broker_pool_limit=None)

I uploaded everything again to heroku with these configurations. I am still testing with only 2 tasks, which are both fast.

I have executed the tasks on heroku 10 times in a row, 7 times they worked. 3 times it looked like they finished too early: the returned result was wrong (correct result is f.e. 30000 and it returned 3 times 18000).

The clients quickly jumped to 20, but they never went above 20, so at least the max client error and lost connection to redis error are resolved.

The big issue now is that the tasks can finish too early, its very important that the returned results are correct, performance is not important at all.

EDIT

Nevermind, nothing is solved, everything seems random. I added two print() in one of the tasks to further debug and uploaded to heroku. After 2 executions I see again connection to redis lost, maximum number of clients reached (Eventhough my redismonitor add-on shows that clients never went above 20)

EDIT

The high amount of clients might be caused by idle clients, which are for some reason never closed (found in a blog post by heroku):

By default, Redis will never close idle connections, which means that if you don't close your Redis connections explicitly, you will lock yourself out of your instance.

To ensure this doesn't happen, Heroku Redis sets a default connection timeout of 300 seconds. This timeout doesn’t apply to non-publish/subscribe clients, and other blocking operations.

I added now a kill function for idle clients right before EVERY one of my tasks:

def kill_idle_clients():
    if 'DYNO' in os.environ:
        redis_db = redis.StrictRedis(host='HOST', port=15249, password='REDISDBPW')
    else:
        redis_db = redis.StrictRedis()

    all_clients = redis_db.client_list()
    counter = 0
    for client in all_clients:
        if int(client['idle']) >= 15:
            redis_db.client_kill(client['addr'])
            counter += 1

    print ('killing idle clients:', counter)

Before a task starts it closes all clients, which idle more than 15 sec. It works again on localhost (but no surprise, it always worked on localhost). I have less clients, but on heroku it worked now only 2 times of 10. 8 times the tasks finished too early again. Maybe the idle clients were not really idle, I have no clue.

Its also almost impossible to test, as every execution of the tasks has a different outcome (Loses connection to redis, reached client limit, finishes too early, works perfect).

EDIT

It seems celery settings were ignored all the time. I was suspicious about this all the time and decided to test it by adding some random arguments and changing values to non-sense. I restarted the celery worker ofc.

I expected to see some errors, but it works like nothing happened.

Everything works like before with these non-sense configurations:

celery = Celery(app.name, broker=app.config['REDIS_URL'], backend=app.config['REDIS_URL'], redis_max_connections='pups', BROKER_TRANSPORT_OPTIONS = {
    'max_connections': 20,
}, broker_pool_limit=None, broker_connection_timeout='pups', pups="pups")
celery.conf.broker_transport_options = {'visibility_timeout': 'pups'}

EDIT

I changed the way I load configurations for celery (from a seperate config file). Seems to work now, but the issues remain the same.

celery_task = Celery(broker=app.config['REDIS_URL'], backend=app.config['REDIS_URL'])
celery_task.config_from_object('celeryconfig')

EDIT

With these configurations I managed to cap the amount of clients on localhost at 18 for all tasks (I tried all 12 tasks). However on heroku it "somehow" works. There are less clients, but the amount reached 20 once, eventhough I thought I could not exceed 18. (I tested on heroku with 4 tasks).

Testing on heroku with all 12 tasks triggers many different SQL errors. I am now more confused than before. It seems the same task is executed multiple times, but I see only 12 task URL's.

I think that because the SQL errors are f.e.:

sqlalchemy.exc.InternalError: (pymysql.err.InternalError) Packet sequence number wrong - got 117 expected 1

or

sqlalchemy.exc.InterfaceError: (pymysql.err.InterfaceError) (0, '')

or

Multiple rows were found for one()

I tested a few times on heroku with 4 tasks and there were times the task results were returned, but the results were super weird.

This time the tasks did not finish too early but returned increased values, it looked like task A has returned the value 2 times and summed it.

Example: Task A must return 10k, but it returned 20k, so the task has been executed twice and the result has been summed.

Here are my current configurations. I still dont understand the math 100%, but I think its (for the amount of clients):

max-conncurency * CELERYD_MAX_TASKS_PER_CHILD

On localhost I found a new CLI command to inspect worker stats and I had max-conncurecy=3 and CELERYD_MAX_TASKS_PER_CHILD=6

CLI command:

celery -A stats_main.celery_task inspect stats

My current configurations:

worker start:

celery worker -A stats_main.celery_task --loglevel=info --autoscale=10,3

config:

CELERY_REDIS_MAX_CONNECTIONS=20
BROKER_POOL_LIMIT=None
CELERYD_WORKER_LOST_WAIT=20
CELERYD_MAX_TASKS_PER_CHILD=6
BROKER_TRANSPORT_OPTIONS = {'visibility_timeout': 18000} # 5 hours
CELERY_RESULT_DB_SHORT_LIVED_SESSIONS = True #useful if: For example, intermittent errors like (OperationalError) (2006, ‘MySQL server has gone away’)

EDIT

Seeing all these SQL errors now I decided to research into a completely different direction. My new theory is, that it could be a MySQL issue.

I adjusted my connection to the MySQL server as described in the answer of this question.

I also found out that pymsql has threadsafety=1, I dont know yet whether this could be an issue, but it seems MySQL has something to do with connections and connection pools.

At the moment I also can say that memory can not be an issue, because if the packages were too big it shouldn't work on localhost, which means I left the max_allowed_packet at the default value, which is around 4MB.

I have also created 3 dummy tasks, which make some simple calculations without connecting to an external MySQL DB. I have executed now 5 times on heroku and there were no errors, the results were always correct, so I assume the issue is not celery, redis, but MySQL, eventhough I have no clue why it would work on localhost. Maybe its a combination of all 3, which lead to the issues on heroku.

EDIT

I adjusted my JS file. Now every task is called one after another, which means they are not async (I still use celery's apply_async because apply did not worked)

So its a hard workaround. I simply created a var for each task, f.e. var task_1_rdy = false;

I also created a function, which runs every 2 seconds and checks whether one task is ready, if ready it will start the next task. I think its easy to understand what I did here.

Tested this on heroku and had no errors at all, even with multiple tasks, so the issue is maybe solved. I need to make more tests but it looks very promising. Ofc. I am not using the async functionality and running task after task will probably have the worst performance, but hey it works now. I will benchmark the performance difference and update the question on monday.

EDIT

I have done a lot of testing today. The time it takes until the tasks complete is the same (sync vs. async) I dont know why, but it it the same.

Working with all 12 tasks on heroku and selecting a huge timerange (huge timerange = tasks take longer, because more data to process):

Again the task results are not precise, the returned values are wrong, only slightly wrong, but wrong and therefore not reliable, f.e. task A must return 20k and on heroku it returned 19500. I dont know how it is possible that data is lost / task returns too early, but after 2 weeks I will give up and try to use a completely different system.

2条回答
走好不送
2楼-- · 2019-07-28 21:26

Now i am 60% sure that it is your task that is taking too long and the server cannot respond within a default web request return time. The 70% / 30% thing is applicable when you are on a local machine, where the network is very fast. On the cloud platform latency is the problem and sometimes it affects your program. Before that, if the celery worker failed it will auto create another worker to finish the unfinished job because of gunicon and celery, which causes the increase of the connection.

So the solution is:

  • Option 1 make your task finish faster

  • Option 2 return an acknowledgement first , calculate at background, and make another api call to send back the results

查看更多
在下西门庆
3楼-- · 2019-07-28 21:29

sounds like you r a rest-api using celery worker redis as msg queue. here is the chk list:

1 in your client did you close the connection after the logic finish

2 celery will new workers, the workers may cause trouble, try monitor celery with flower

3 make sure your client finish the task, try debug with print something, sometimes staging and local has network issues which are stopping you from ending the celery task

4 if you are using redis for celery msg queue, try monitor the number of queues, maybe they auto scale up?

查看更多
登录 后发表回答