Optimizing SQLAlchemy PostgreSQL queries

One of our Flask-Admin pages became very slow recently and I had to investigate the reason.

The page contains a custom Flask-Admin ModelView whish shows some data filtered by the certain criteria. The underlaying table is quite large so it could be the reason why the page is slow. First, use psql to connect to the databse: psql <connection-url>. Then refresh the page and run the following command to see the running queries:

SELECT query FROM pg_stat_activity;

In my case it gave me the following query:

SELECT count('*') FILTER (WHERE log_user_status.user_id = 123456) AS anon_1 FROM log_user_status

If we analyze this query, it will give us the following output:

explain analyze SELECT count('*') FILTER (WHERE log_user_status.user_id = 123456) AS anon_1 FROM log_user_status;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=1534349.14..1534349.14 rows=1 width=8) (actual time=90392.827..90392.827 rows=1 loops=1)
   ->  Gather  (cost=1534349.03..1534349.14 rows=1 width=8) (actual time=90392.756..90403.544 rows=2 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Partial Aggregate  (cost=1533349.03..1533349.04 rows=1 width=8) (actual time=90390.780..90390.780 rows=1 loops=2)
               ->  Parallel Seq Scan on log_user_status  (cost=0.00..1509920.52 rows=23428508 width=4) (actual time=0.116..88705.311 rows=20713255 loops=2)
 Planning time: 0.080 ms
 Execution time: 90403.621 ms
(8 rows)

Execution time will be more than 90 seconds because Postgres performs sequential scan on top of the entire table and then filters the result by user_id.

This query defenitely can be optimized. The proper query should look like this: count(*) from log_user_status where user_id = 123456;.

Analyzing this optimized query gives the following result:

explain analyze select count(*) from log_user_status where user_id = 123456;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=215.93..215.94 rows=1 width=8) (actual time=0.364..0.364 rows=1 loops=1)
   ->  Index Only Scan using ix_log_user_status_user_id on log_user_status  (cost=0.11..215.28 rows=1314 width=0) (actual time=0.023..0.302 rows=963 loops=1)
         Index Cond: (user_id = 123456)
         Heap Fetches: 213
 Planning time: 0.084 ms
 Execution time: 0.395 ms
(6 rows)

Execution time droped to 0.4 second! Please note that now instead of parallel sequential scan we have index only scan (https://wiki.postgresql.org/wiki/Index-only_scans). This type of scan is performed on top of the index without event looking into the actual table which gives us the huge performance boost.

So it’s time to look at our application and figure out why the query is so inefficient.

Here is the Python code which contains the SQLAlchemy query:

query = self.session.query(func.count('*').filter(
    self.model.user_id == user.id))

Please note that the query doesn’t tell which model should be used to pull the data. Let’s fix it by adding select_from call:

query = self.session.query(func.count('*')).select_from(self.model).filter(
    self.model.user_id == user.id)

Now let’s check the result SQL generated by this expression. You can do it by adding SQLALCHEMY_ECHO = True to your Flask configuration file.

The result query would look exactly as we expected:

SELECT count("*") AS count_1
FROM log_user_status
WHERE log_user_status.user_id = 123456

The page load dropped from 1.5 minute to less than 1 second 🕶️.

Optimizing SQLAlchemy PostgreSQL queries

Categories

Tags