Multi-tenancy with SQLAlchemy

2020-02-23 04:31发布

问题:

I've got a web-application which is built with Pyramid/SQLAlchemy/Postgresql and allows users to manage some data, and that data is almost completely independent for different users. Say, Alice visits alice.domain.com and is able to upload pictures and documents, and Bob visits bob.domain.com and is also able to upload pictures and documents. Alice never sees anything created by Bob and vice versa (this is a simplified example, there may be a lot of data in multiple tables really, but the idea is the same).

Now, the most straightforward option to organize the data in the DB backend is to use a single database, where each table (pictures and documents) has user_id field, so, basically, to get all Alice's pictures, I can do something like

user_id = _figure_out_user_id_from_domain_name(request)
pictures = session.query(Picture).filter(Picture.user_id==user_id).all()

This is all easy and simple, however there are some disadvantages

  • I need to remember to always use additional filter condition when making queries, otherwise Alice may see Bob's pictures;
  • If there are many users the tables may grow huge
  • It may be tricky to split the web application between multiple machines

So I'm thinking it would be really nice to somehow split the data per-user. I can think of two approaches:

  1. Have separate tables for Alice's and Bob's pictures and documents within the same database (Postgres' Schemas seems to be a correct approach to use in this case):

    documents_alice
    documents_bob
    pictures_alice
    pictures_bob
    

    and then, using some dark magic, "route" all queries to one or to the other table according to the current request's domain:

    _use_dark_magic_to_configure_sqlalchemy('alice.domain.com')
    pictures = session.query(Picture).all()  # selects all Alice's pictures from "pictures_alice" table
    ...
    _use_dark_magic_to_configure_sqlalchemy('bob.domain.com')
    pictures = session.query(Picture).all()  # selects all Bob's pictures from "pictures_bob" table
    
  2. Use a separate database for each user:

    - database_alice
       - pictures
       - documents
    - database_bob
       - pictures
       - documents 
    

    which seems like the cleanest solution, but I'm not sure if multiple database connections would require much more RAM and other resources, limiting the number of possible "tenants".

So, the question is, does it all make sense? If yes, how do I configure SQLAlchemy to either modify the table names dynamically on each HTTP request (for option 1) or to maintain a pool of connections to different databases and use the correct connection for each request (for option 2)?

回答1:

After pondering on jd's answer I was able to achieve the same result for postgresql 9.2, sqlalchemy 0.8, and flask 0.9 framework:

from sqlalchemy import event
from sqlalchemy.pool import Pool
@event.listens_for(Pool, 'checkout')
def on_pool_checkout(dbapi_conn, connection_rec, connection_proxy):
    tenant_id = session.get('tenant_id')
    cursor = dbapi_conn.cursor()
    if tenant_id is None:
        cursor.execute("SET search_path TO public, shared;")
    else:
        cursor.execute("SET search_path TO t" + str(tenant_id) + ", shared;")
    dbapi_conn.commit()
    cursor.close()


回答2:

What works very well for me it to set the search path at the connection pool level, rather than in the session. This example uses Flask and its thread local proxies to pass the schema name so you'll have to change schema = current_schema._get_current_object() and the try block around it.

from sqlalchemy.interfaces import PoolListener
class SearchPathSetter(PoolListener):
    '''
    Dynamically sets the search path on connections checked out from a pool.
    '''
    def __init__(self, search_path_tail='shared, public'):
        self.search_path_tail = search_path_tail

    @staticmethod
    def quote_schema(dialect, schema):
        return dialect.identifier_preparer.quote_schema(schema, False)

    def checkout(self, dbapi_con, con_record, con_proxy):
        try:
            schema = current_schema._get_current_object()
        except RuntimeError:
            search_path = self.search_path_tail
        else:
            if schema:
                search_path = self.quote_schema(con_proxy._pool._dialect, schema) + ', ' + self.search_path_tail
            else:
                search_path = self.search_path_tail
        cursor = dbapi_con.cursor()
        cursor.execute("SET search_path TO %s;" % search_path)
        dbapi_con.commit()
        cursor.close()

At engine creation time:

engine = create_engine(dsn, listeners=[SearchPathSetter()])


回答3:

Ok, I've ended up with modifying search_path in the beginning of every request, using Pyramid's NewRequest event:

from pyramid import events

def on_new_request(event):

    schema_name = _figire_out_schema_name_from_request(event.request)
    DBSession.execute("SET search_path TO %s" % schema_name)


def app(global_config, **settings):
    """ This function returns a WSGI application.

    It is usually called by the PasteDeploy framework during
    ``paster serve``.
    """

    ....

    config.add_subscriber(on_new_request, events.NewRequest)
    return config.make_wsgi_app()

Works really well, as long as you leave transaction management to Pyramid (i.e. do not commit/roll-back transactions manually, letting Pyramid to do that at the end of request) - which is ok as committing transactions manually is not a good approach anyway.