Introduction
Recently I needed to extract a list of URLs from a MySQL table. All those URLs were inserted in a single TEXT
column, however they were all separated by semicolons. The situation was more or less like the following:
SELECT * FROM links
---------------------------------------------------------------
| id | urls |
---------------------------------------------------------------
| 1 | http://first.link;http://second.link |
| 2 | https://third.link |
| 3 | |
| 4 | http://fourth.link;http://fifth.link;https://sixth.link|
| ...| ... |
I wanted to get those URLs with a single query, without using any programming language. Luckily for me I could make a few assumptions to make my work a lot easier:
- The column contained only urls or empty strings
- There were a maximum of 5 urls for each row
Diggin in
MySQL does not include a function to split a delimited string. You can define a function yourself for this purpose. However I didn’t have the required privileges on the database to create such function, so I just sticked with subqueries.
The first thing that I tried was a query to count the number of occurrences of semicolons (and therefore of urls) in the urls
column:
SELECT id, ROUND((LENGTH(urls) - LENGTH(REPLACE (urls, ";", ""))) / LENGTH(";")) AS count
FROM links
--------------
| id | count |
-------------
| 1 | 1 |
| 2 | 0 |
| 3 | 0 |
| 4 | 2 |
| ...| ... |
As you can see, there is a subtle little problem here. The query counts the semicolon occurrences, therefore the row with an empty string and one with a single url have the same value. We need to introduce a condition, for example we can fix the query as follows:
SELECT id, ROUND((LENGTH(urls) - LENGTH(REPLACE (urls, ";", ""))) / LENGTH(";")) AS count
FROM links
WHERE urls <> ""
--------------
| id | count |
-------------
| 1 | 1 |
| 2 | 0 |
| 4 | 2 |
| ...| ... |
Much better! Now we know that the first row contains one semicolon (two urls), the second row contains zero semicolons (one url), and so on.
Now we need to get the actual data. The general idea is to use the count
value to extract, in a separate column, only the portion of string that we are interested in. Remember that we have a maximum of 5 urls? That helps us tremendously. Let’s visualize the idea:
-----------------------------------------------------------------------------
| id | count | first url | second url | third url | fourth url | fifth url |
-----------------------------------------------------------------------------
| 1 | 1 | | | | | |
| 2 | 0 | | | | | |
| 4 | 2 | | | | | |
| ... | ... | | | | | |
We can compute url columns in this way:
- first url: from the start of the string until the first
;
occurrence - second url: from the the first
;
occurrence to the second - third url: from the the second
;
occurrence to the third - …
For this purpose we have at our disposal the SUBSTRING_INDEX(str,delim,count)
function
Returns the substring from string
str
beforecount
occurrences of the delimiterdelim
. Ifcount
is positive, everything to the left of the final delimiter (counting from the left) is returned. Ifcount
is negative, everything to the right of the final delimiter (counting from the right) is returned.
The final query is the following:
SELECT id, intermediate.count,
IF(intermediate.count >= 0, SUBSTRING_INDEX(urls, ';', 1), NULL) AS first,
IF(intermediate.count >= 1, SUBSTRING_INDEX(SUBSTRING_INDEX(urls, ';', 2), ';', -1), NULL) AS second,
IF(intermediate.count >= 2, SUBSTRING_INDEX(SUBSTRING_INDEX(urls, ';', 3), ';', -1), NULL) AS third,
IF(intermediate.count >= 3, SUBSTRING_INDEX(SUBSTRING_INDEX(urls, ';', 4), ';', -1), NULL) AS fourth,
IF(intermediate.count >= 4, SUBSTRING_INDEX(SUBSTRING_INDEX(urls, ';', 5), ';', -1), NULL) AS fifth
FROM
links
JOIN (
SELECT id,
ROUND ((LENGTH(urls) - LENGTH( REPLACE ( urls, ";", "") ) ) / LENGTH(";")) AS count
FROM links
WHERE urls <> ""
) AS intermediate
ON links.id = intermediate.id
-----------------------------------------------------------------------------------------------
| id | count | first | second | third | fourth | fifth |
-----------------------------------------------------------------------------------------------
| 1 | 1 | http://first.link | http://second.link | NULL | NULL | NULL |
| 2 | 0 | https://third.link | NULL | NULL | NULL | NULL |
| 4 | 2 | http://fourth.link | http://fifth.link | https://sixth.link | NULL | NULL |
| ...| ... | | | | | |
The icing on the cake would be pivoting the table from columns to rows, filtering the NULL ones. Unfortunately MySQL does not have a specific statement allowing to do this, for example as SQLServer does. So the easiest solution is just to export the resulting data in Atom, SublimeText or in your favorite editor, and later with a few keyboard tricks you should finally have your list of urls.