Introduction
Suppose that we have a MySQL table with the following fields:
SHOW COLUMNS FROM survey_responses;
---------------------------------------------------------------------------------------------------------
| id (PK) | survey_id (FK) | page_id (FK) | question_id (FK) | option_id (FK) | session_id (FK) | value |
---------------------------------------------------------------------------------------------------------
You don’t need a lot of context, you just need to know thtat this table is used in a survey application, where users can create their surveys and submit them to other users.
The Problem
Suppose that, because a bad decision in the past (we have always been there, didn’t we?), there is no unique constraint on the foreign keys, so it is possible to have duplicate rows (and therefore duplicate responses for each survey and question in the same interview session). That would be unfortunate!
So how do we clean the table for duplicate rows, before applying the unique constraint to the required columns? In this case we can leverage the id
column, which is an auto_increment
, to distinguish which row we need to keep from the duplicate ones.
With the following query we can check the IDs of the rows that are duplicate. We use the GROUP BY
clause for the columns that should be unique:
SELECT GROUP_CONCAT(id), survey_id, page_id, question_id, option_id, session_id, COUNT(value) AS c
FROM table
GROUP BY survey_id, page_id, question_id, option_id, session_id
HAVING c > 1
The Solution
The query to delete the duplicate rows is straightforward. We just need to keep the row with the lower id
(which is the first response to the survey) from the ones that are duplicated.
DELETE a
FROM survey_responses AS a, survey_responses AS b
WHERE a.id < b.id
AND a.survey_id = b.survey_id
AND a.page_id = b.page_id
AND a.question_id, = b.question_id
AND a.option_id = b.option_id
AND a.session_id = b.session_id
In this case we are sure that all the columns submitted for comparison (survey_id
, page_id
, question_id
, option_id
, session_id
) contain values, so we can safely use the =
operator.
However if one of our columns contains NULL
values we also have to check that each column is NULL
.
DELETE a
FROM survey_responses AS a, survey_responses AS b
WHERE a.id < b.id
AND (a.survey_id = b.survey_id OR a.survey_id IS NULL AND b.survey_id IS NULL)
AND (a.page_id = b.page_id OR a.page_id IS NULL AND b.page_id IS NULL)
AND (a.question_id, = b.question_id OR a.question_id IS NULL AND b.question_id IS NULL)
AND (a.option_id = b.option_id OR a.option_id IS NULL AND b.option_id IS NULL)
AND (a.session_id = b.session_id OR a.session_id IS NULL AND b.session_id IS NULL)
But we can simplify this query using the spaceship operator, which allows us to compare columns without worrying for NULL
values, in fact it never yields NULL
as a possible outcome, as opposed to the =
operator.
DELETE a
FROM survey_responses AS a, survey_responses AS b
WHERE a.id < b.id
AND a.survey_id <=> b.survey_id
AND a.page_id <=> b.page_id
AND a.question_id, <=> b.question_id
AND a.option_id <=> b.option_id
AND a.session_id <=> b.session_id
Here a quick reminder of the difference between the spaceship operator and the regular equal operator. For the columns a
and b
we use 1
to indicate that the column contains a value. For the other columns we use the actual outcome of the operators.
----------------------------
| a | b | = | <=> |
----------------------------
| 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 |
| 1 | NULL | NULL | 0 |
| NULL | 1 | NULL | 0 |
| NULL | NULL | NULL | 1 |