2023-10-12

Troubleshooting: SQL-Modes (and my sanity)

This week, I got confronted with the "ONLY_FULL_GROUP_BY" SQL-Mode, mostly because I wrote a faulty query and introduced configuration errors in our local development stack some weeks prior. Both errors combined lead to some confusion, so I wanted to write down the aftermath to have the ability to come back every couple months to do a facepalm.

The Problem

I refactored parts of our business logic and slightly altered a jooq-query, after some manual testing everything seemed fine, so I opened a merge-request and ran our unit/integration-tests: 40 Tests failed, all with:

ERROR 1055 (42000): Expression #2 of SELECT [x] is not in GROUP BY clause and contains nonaggregated column 'xxxx' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

I had no idea what only_full_group_by was, and the error message was kinda cryptic for me. Why does it only fail inside our integration tests? A quick google search revealed, that this sql-mode is switched on per default on mysql version >5.7, but we already use version >8.0, so that should not be the issue.

What was the problem in the query

A simplified example: We have a "car-parts" table and a translation table to localize part names:

Parts:

IDName
1Reifen
2Tür

Translation Table:

IDPartIDLangTranslation
11deReifen
21enTyre
32deTür
42enDoor

Part of my refactoring was to only return one localized string from the translation table, instead of returning all available translations in a map, as we did before with json_merge_preserve, concat and group_concat

The new (really simplified) query looked like this:

1select parts.id,
2 translation.name
3from parts
4 join translation on (translation.PartID = parts.id and
5 (translation.language = 'de' or translation.language = 'en' ))
6group by parts.id
7having parts.id = 1021;

I made 2 mistakes:

My goal with this query was to fall back to the default translation in case if there is no translation for the item being fetched, that's completely wrong, and I would need a completely different query to project this logic in code.

Furthermore, the default translation is already inside the parts table, a simple coalesce on the parts.name and translation.translation columns would do the job, it's a little harder to spot inside the real codebase, and by having one translation twice in different tables our database is not normalized, but it makes fixing this query easier, and I still feel ashamed not noticing the presence of default translation in the first place.

The SQL Error

By introducing the or logic inside the join, multiple translations exist for a single part-ID (the id that we group), the group by function cannot decide what translated string to choose from. Mysql versions prior to 5.7 operated in a so called "forgiving mode" and would just select one random translation, the application would receive data from the database, but it would not be consistent and undeterministic.

One solution could be to concatenate multiple results, but as we need only one translation anyway, the mentioned coalesce trick is enough for us. After getting rid of the or statement, only one translation gets returned, and our query works again.

Why did the problem not occur on our dev db?

I set up a local docker container to run mysql on my machine, and we have a docker run command in our README.MD because we made some modifications:

1docker run --name x -p 33201:3306 -e MYSQL_ROOT_PASSWORD=x -e MYSQL_USER=x -e MYSQL_PASSWORD=x --sql-mode= -e MYSQL_DATABASE=x -d mysql:8.0.30 --character-set-server=utf8mb4 --collation-server=utf8mb4_general_ci --max-allowed-packet=67108864 --lower_case_table_names=1
2

Spot it?

I screwed up while copy-pasting some commands and --sql-mode= sneaked into the docker command, resulting in that our local dev database started without a single active sql-mode 🫠. A small, hard to spot change with consequences.

I have some serious doubts in my sanity now and will triple check every change I do from now on ... especially when copy-pasting something. At least I learned something about sql-modes ... the hard way. I'm glad that this project has a lot of integration tests.

Example

https://onecompiler.com/mysql/3zqk8rjmd

1
2-- create
3CREATE TABLE Parts (
4 id INTEGER PRIMARY KEY,
5 name TEXT NOT NULL
6);
7
8-- create
9CREATE TABLE translation (
10 id INTEGER PRIMARY KEY,
11 partID INTEGER,
12 lang varchar(21) NOT NULL,
13 name TEXT NOT NULL,
14 constraint uc_valueclass_translation
15 unique (partID, lang)
16);
17
18
19
20-- insert
21INSERT INTO Parts VALUES (0001, 'A');
22INSERT INTO Parts VALUES (0002, 'B');
23INSERT INTO Parts VALUES (0003, 'C');
24
25INSERT INTO translation VALUES (0001,0001, 'DE', 'de');
26INSERT INTO translation VALUES (0002,0001, 'EN', 'en');
27
28select
29 Parts.id,
30 translation.name
31from Parts
32 join translation on (translation.PartID = Parts.id and translation.lang = 'EN' )
33group by Parts.id;

Additional note: My assumption why this query still works: Even though translation.name is not dependent on the GROUP BY clause, the query still runs because of our unique constraint. A specific partID-lang combination can only occur once, so the result is always unique per ID. If you remove the constraint the error reappears.