Finding partial barcodes in SQL
With the Questionmark App you can scan the barcode of a supermarket product, and see how sustainable it is compared to other products. This barcode is generally a 13 or 8 digit international article number (EAN/UPC), which identifies the product. These numbers are handed out by the international GS1 organisation.
Barcode: a unique identifier?
This number is like a unique identifier for a (consumer) product. Pretty uniquely. But not completely. There are rules for this (also NL-specific ones). These aren’t always followed perfectly, however. And for us, the precise composition and sourcing of the product is important. So a small change in sugar or salt content, for example (or even a gradual change over time), may not warrant a new EAN-code. Bigger changes, e.g. when the brand name changes, would generally result in a new code.
So our Questionmark products generally have one barcode, sometimes more. We can’t really do much about different products with the same barcode - and thus we take the latest information we have. That fits with our goal of improving competition on sustainability and health.
GS1 prefixes
But there’s more. Each EAN-number starts with a GS1 Prefix,
which is a 3-digit country code, handed out by the country’s GS1 member organisation. Some prefixes, however,
are meant to be used internally by companies. These identifiers are not global. This is the range 020
-029
.
Store-packaged products
In some supermarkets fruits and vegetables can be weighed by the consumer, where a barcode is printed on
the spot. This barcode is a special one, where part is product number, and part is weight, quantity or
price. These prefixes all start with 2
. How these are handled exactly depends on the GS1 member organisation.
Sweden has a good explanation, and the
Dutch situation is explained in section 2.2.14 and 3.3.1 of their handbook
(which is in Dutch).
The bad news is that the same number is often used by different companies. We have found many store-packaged EAN-codes (we call them partial barcodes) that are completely different products in different supermarkets.
Finding barcodes
Now to the more technical part of this post. Say a user of our app scans a barcode, and we’d like to locate the product associated to it. For ‘regular’ barcodes, we can just search by number, but for store-packaged barcodes, the match is slightly more complicated.
Let’s see how we can search using the PostgreSQL database.
Dutch rules for weighted products
Dutch EAN-codes that consist of a product code (i
) and a price (e
) or weight (w
), together
with check digits (c
), have the following format:
21
iiii
c
eeeee
c
: price in euro cents22
iiii
c
eeeee
c
: price in euro cents23
iiiii
c
eeee
c
: price in euro cents (nationally unique)28
iiii
c
wwwww
c
: weight in grams
For example, the EAN-code 2164941000227
would have product code 6494
and cost 0.22 euro’s.
Setup a sample database
--- some barcodes with their product name
CREATE TABLE barcodes (barcode VARCHAR PRIMARY KEY, name VARCHAR);
INSERT INTO barcodes VALUES
('8712439020109', 'Jori Pindakaas 350g'),
('8713576100129', 'TerraSana Pindakaas 250g'),
('2164941000227', 'EkoPlaza Appel Elstar'),
('2232072000000', 'Jumbo Jonagold'),
('2232140001472', 'Jumbo Elstar Appels'),
('125645', 'Jumbo Appel Jonagold');
Find by ‘regular’ barcode
Finding the ‘regular’ barcode 8712439020109
is easy:
SELECT name FROM barcodes WHERE barcode = '8712439020109';
--- => 'Jori Pindakaas 350g'
Find by ‘partial’ barcode
But for the weighted product 2232073001136
, we need something else. In this case, the
identifying product number is only the first 6 digits, so we can query using:
SELECT name FROM barcodes WHERE
barcode LIKE '223207%' AND LENGTH(barcode) = 13;
--- => 'Jumbo Jonagold'
But say we have a list of barcodes we want to find, that needs a bit more intelligence. Let’s
first construct a list of barcodes (8712439020109
, 2232140001481
, 21987
), with the number of digits to compare:
SELECT
barcode,
CASE WHEN LENGTH(barcode) = 13 THEN
CASE LEFT(barcode, 2)
WHEN '21' THEN 6
WHEN '22' THEN 6
WHEN '23' THEN 7
WHEN '28' THEN 6
ELSE 13
END
ELSE LENGTH(barcode)
END AS id_len
FROM (
VALUES ('8712439020109'), ('2232140001481'), ('21987')
) search (barcode);
--- => '8712439020109' 13
--- => '2232140001481' 6
--- => '21987' 5
Applying this search to the barcodes table, we get:
SELECT barcodes.barcode, barcodes.name FROM barcodes
INNER JOIN (
SELECT
barcode,
CASE WHEN LENGTH(barcode) = 13 THEN
CASE LEFT(barcode, 2)
WHEN '21' THEN 6
WHEN '22' THEN 6
WHEN '23' THEN 7
WHEN '28' THEN 6
ELSE 13
END
ELSE LENGTH(barcode)
END AS id_len
FROM (
VALUES ('8712439020109'), ('2232140001481'), ('21987')
) s (barcode)
) s ON
LENGTH(barcodes.barcode) = LENGTH(s.barcode) AND
LEFT(barcodes.barcode, s.id_len) = LEFT(s.barcode, s.id_len);
--- => '8712439020109' 'Jori Pindakaas 350g'
--- => '2232140001472' 'Jumbo Elstar Appels'
This is even applicable when JOIN
ing another table with barcodes instead of the list used here.
update While this approach works, it is not very performant (as you may have guessed). A cleaner solution may be to normalize barcodes on ingress. This is discussed in another post.