Finding partial barcodes in SQL
With the Questionmark App you can scan the barcode of a supermarket product, and see how sustainable it is compared to other products. This barcode is generally a 13 or 8 digit international article number (EAN/UPC), which identifies the product. These numbers are handed out by the international GS1 organisation.
Barcode: a unique identifier?
This number is like a unique identifier for a (consumer) product. Pretty uniquely. But not completely. There are rules for this (also NL-specific ones). These aren’t always followed perfectly, however. And for us, the precise composition and sourcing of the product is important. So a small change in sugar or salt content, for example (or even a gradual change over time), may not warrant a new EAN-code. Bigger changes, e.g. when the brand name changes, would generally result in a new code.
So our Questionmark products generally have one barcode, sometimes more. We can’t really do much about different products with the same barcode - and thus we take the latest information we have. That fits with our goal of improving competition on sustainability and health.
But there’s more. Each EAN-number starts with a GS1 Prefix,
which is a 3-digit country code, handed out by the country’s GS1 member organisation. Some prefixes, however,
are meant to be used internally by companies. These identifiers are not global. This is the range
In some supermarkets fruits and vegetables can be weighed by the consumer, where a barcode is printed on
the spot. This barcode is a special one, where part is product number, and part is weight, quantity or
price. These prefixes all start with
2. How these are handled exactly depends on the GS1 member organisation.
Sweden has a good explanation, and the
Dutch situation is explained in section 2.2.14 and 3.3.1 of their handbook
(which is in Dutch).
The bad news is that the same number is often used by different companies. We have found many store-packaged EAN-codes (we call them partial barcodes) that are completely different products in different supermarkets.
Now to the more technical part of this post. Say a user of our app scans a barcode, and we’d like to locate the product associated to it. For ‘regular’ barcodes, we can just search by number, but for store-packaged barcodes, the match is slightly more complicated.
Let’s see how we can search using the PostgreSQL database.
Dutch rules for weighted products
Dutch EAN-codes that consist of a product code (
i) and a price (
e) or weight (
with check digits (
c), have the following format:
c: price in euro cents
c: price in euro cents
c: price in euro cents (nationally unique)
c: weight in grams
For example, the EAN-code
2164941000227 would have product code
6494 and cost 0.22 euro’s.
Setup a sample database
--- some barcodes with their product name CREATE TABLE barcodes (barcode VARCHAR PRIMARY KEY, name VARCHAR); INSERT INTO barcodes VALUES ('8712439020109', 'Jori Pindakaas 350g'), ('8713576100129', 'TerraSana Pindakaas 250g'), ('2164941000227', 'EkoPlaza Appel Elstar'), ('2232072000000', 'Jumbo Jonagold'), ('2232140001472', 'Jumbo Elstar Appels'), ('125645', 'Jumbo Appel Jonagold');
Find by ‘regular’ barcode
Finding the ‘regular’ barcode
8712439020109 is easy:
SELECT name FROM barcodes WHERE barcode = '8712439020109'; --- => 'Jori Pindakaas 350g'
Find by ‘partial’ barcode
But for the weighted product
2232073001136, we need something else. In this case, the
identifying product number is only the first 6 digits, so we can query using:
SELECT name FROM barcodes WHERE barcode LIKE '223207%' AND LENGTH(barcode) = 13; --- => 'Jumbo Jonagold'
But say we have a list of barcodes we want to find, that needs a bit more intelligence. Let’s
first construct a list of barcodes (
21987), with the number of digits to compare:
SELECT barcode, CASE WHEN LENGTH(barcode) = 13 THEN CASE LEFT(barcode, 2) WHEN '21' THEN 6 WHEN '22' THEN 6 WHEN '23' THEN 7 WHEN '28' THEN 6 ELSE 13 END ELSE LENGTH(barcode) END AS id_len FROM ( VALUES ('8712439020109'), ('2232140001481'), ('21987') ) search (barcode); --- => '8712439020109' 13 --- => '2232140001481' 6 --- => '21987' 5
Applying this search to the barcodes table, we get:
SELECT barcodes.barcode, barcodes.name FROM barcodes INNER JOIN ( SELECT barcode, CASE WHEN LENGTH(barcode) = 13 THEN CASE LEFT(barcode, 2) WHEN '21' THEN 6 WHEN '22' THEN 6 WHEN '23' THEN 7 WHEN '28' THEN 6 ELSE 13 END ELSE LENGTH(barcode) END AS id_len FROM ( VALUES ('8712439020109'), ('2232140001481'), ('21987') ) s (barcode) ) s ON LENGTH(barcodes.barcode) = LENGTH(s.barcode) AND LEFT(barcodes.barcode, s.id_len) = LEFT(s.barcode, s.id_len); --- => '8712439020109' 'Jori Pindakaas 350g' --- => '2232140001472' 'Jumbo Elstar Appels'
This is even applicable when
JOINing another table with barcodes instead of the list used here.
update While this approach works, it is not very performant (as you may have guessed). A cleaner solution may be to normalize barcodes on ingress. This is discussed in another post.