Storing barcodes
In a previous blogpost, we talked about barcodes and EAN-numbers, and how in some cases barcodes with different numbers still reference the same product. The solution for searching there was not very performant, and wouldn’t scale to more than a handful of barcodes to check.
This post describes how to store barcodes in the PostgreSQL database in a normalized way, so that searching can be done without slowing down queries. As a bonus, we get a stronger uniqueness constraint.
Partial barcodes
Let’s quickly summarize why this is important again. Barcodes, we’ll focus on EAN-numbers, consist of
a 3-digit country code, and a product identifier. Some of these country codes are reserved, however,
and each country can use them as they wish. In The Netherlands, the country codes 21
, 22
, 23
and 28
are reserved for codes that include a short product indentifier as well as a price or weight.
This allows for example supermarkets to weigh apples, and print a barcode with product identifier
and weight.
Since we are just interested in the product identifier, and want to find any matching product (regardless its price or weight); and we want to be sure that we only have one product in our database, we’d like to store barcodes without price or weight.
Normalized barcodes
First we need a set of functions to normalize a barcode, so that for partial barcodes any price or weight is cleared:
--- number of digits of the product identification part of the barcode
CREATE FUNCTION ean13_nl_id_length(text) RETURNS integer
LANGUAGE plpgsql IMMUTABLE PARALLEL SAFE
AS $_$
BEGIN
RETURN CASE
WHEN LENGTH($1) = 13 THEN
CASE LEFT($1, 2)
WHEN '21' THEN 6
WHEN '22' THEN 6
WHEN '23' THEN 7
WHEN '28' THEN 6
ELSE 13
END
ELSE LENGTH($1)
END;
END;
$_$;
--- barcode with weight or price removed
CREATE FUNCTION ean13_nl_id(text) RETURNS text
LANGUAGE plpgsql IMMUTABLE PARALLEL SAFE
AS $_$
BEGIN
RETURN LEFT($1, ean13_nl_id_length($1));
END;
$_$;
--- normalized barcode
CREATE FUNCTION ean13_nl_normalize(text) RETURNS text
LANGUAGE plpgsql IMMUTABLE PARALLEL SAFE
AS $_$
BEGIN
RETURN CASE
WHEN LENGTH($1) = 13 THEN
RPAD(ean13_nl_id($1), 13, '0')
ELSE $1
END;
END;
$_$;
What do these functions do? Let’s find out:
SELECT
ean,
ean13_nl_id_length(ean) AS id_length,
ean13_nl_id(ean) AS id,
ean13_nl_normalize(ean) AS normalized
FROM UNNEST(ARRAY['1234', '12345', '0000000000000', '2201234567890', '2301234567890']) ean;
--- => ean id_length id normalized
--- => '1234' 4 '1234' '1234'
--- => '12345' 5 '12345' '12345'
--- => '0000000000000' 13 '0000000000000' '0000000000000'
--- => '2201234567890' 6 '220123' '2201230000000'
--- => '2301234567890' 7 '2301234' '2301234000000'
The ean13_nl
data type
In PostgreSQL, one can define a new data type (which is called a DOMAIN
), and place restrictions on
its contents. This references the last function we just created:
CREATE DOMAIN ean13_nl AS text
CONSTRAINT ean13_nl_check CHECK ((VALUE = ean13_nl_normalize(VALUE)));
Setup a sample database
Let’s put it to use in a table:
CREATE TABLE barcodes (barcode ean13_nl PRIMARY KEY, name VARCHAR);
INSERT INTO barcodes VALUES
(ean13_nl_normalize('8712439020109'), 'Jori Pindakaas 350g'),
(ean13_nl_normalize('8713576100129'), 'TerraSana Pindakaas 250g'),
(ean13_nl_normalize('2164941000227'), 'EkoPlaza Appel Elstar'),
(ean13_nl_normalize('2232072000000'), 'Jumbo Jonagold'),
(ean13_nl_normalize('2232140001472'), 'Jumbo Elstar Appels'),
(ean13_nl_normalize('125645'), 'Jumbo Appel Jonagold');
Let’s have a look at what’s in the table now:
SELECT * FROM barcodes`;
--- => '8712439020109' 'Jori Pindakaas 350g'
--- => '8713576100129' 'TerraSana Pindakaas 250g'
--- => '2164940000000' 'EkoPlaza Appel Elstar'
--- => '2232070000000' 'Jumbo Jonagold'
--- => '2232140000000' 'Jumbo Elstar Appels'
--- => '125645' 'Jumbo Appel Jonagold'
Note that it isn’t possible to insert a non-normalized barcode, which helps to maintain consistency:
INSERT INTO barcodes VALUES ('2164941000238', 'Foo');
--- ERROR: value for domain ean13_nl violates check constraint "ean13_nl_check"
Find by barcode
Just make sure to feed each barcode through the ean13_nl_normalize
function:
SELECT name FROM barcodes WHERE barcode = ean13_nl_normalize('2232140001481');
--- => 'Jumbo Elstar Appels'
One can also join by barcode:
SELECT a.* FROM barcodes a
INNER JOIN
UNNEST(ARRAY['8712439020109', '2232140001481', '21987']) b
ON a.barcode = ean13_nl_normalize(b);
--- => '8712439020109' 'Jori Pindakaas 350g'
--- => '2232140001472' 'Jumbo Elstar Appels'
Finding multiple barcodes
To make it just a bit easier to find multiple barcodes, one can create another function:
--- normalized barcodes
CREATE FUNCTION ean13_nl_normalize(text[]) RETURNS text[]
LANGUAGE plpgsql IMMUTABLE
AS $_$
BEGIN
RETURN ARRAY(SELECT DISTINCT ean13_nl_normalize(g.i) FROM unnest($1) g(i));
END;
$_$;
This can be used to search for multiple barcodes:
SELECT * FROM barcodes
WHERE barcode = ANY(ean13_nl_normalize(ARRAY['8712439020109', '2232140001481', '21987']));
--- => '8712439020109' 'Jori Pindakaas 350g'
--- => '2232140001472' 'Jumbo Elstar Appels'
This may be slightly easier in complex queries.