19 February 2021

We are using PostgreSQL as a database, with much success. Two of our main datastores are one containing raw product data and another processed product data. Usually this separation holds well, but there are times we’d like to lay them side by side.

PostgreSQL FDW allows querying external data sources from within the database. We use it to connect from the processed to the raw database. In these cases there is a limited list of products we’d like to inspect, augmenting it with raw data.

A basic example

This worked fairly well in many cases. To illustrate, I’ll give a small example. We assume that you have created the databases sources and products and have superuser privileges. In this example we’ll have one database server with two databases, in reality these would be on different physical servers.

-- populate the sources database
\c sources
CREATE TABLE sources (id INTEGER, source VARCHAR, barcode VARCHAR, name VARCHAR);
INSERT INTO sources VALUES
  (1002, 'Retailer A', '2162130000000', 'Pumpkin (whole)'),
  (1003, 'Retailer B', '2162130000000', 'Pumpkin'),
  (1004, 'Retailer A', '8718452095551', 'Cake'),
  (1005, 'Retailer C', '8718452095551', 'Farmer cake'),
  (1006, 'Retailer B', '8719326012759', 'Coffee'),
  (1007, 'Retailer C', '8719326012759', 'Holy beans');

-- populate the products database
\c products
CREATE TABLE products (id INTEGER, barcode VARCHAR, name VARCHAR);
INSERT INTO products VALUES
  (1, '2162130000000', 'Pumpkin'),
  (2, '8718452095551', 'Farmer''s Cake'),
  (3, '8719326012759', 'Holy beans coffee');

Now that the tables are set up and have some data, we can query them separately.

\c sources
SELECT * FROM sources WHERE barcode = '8718452095551';
┌──────┬────────────┬───────────────┬─────────────┐
│  id  │   source   │    barcode    │    name     │
├──────┼────────────┼───────────────┼─────────────┤
│ 1004 │ Retailer A │ 8718452095551 │ Cake        │
│ 1005 │ Retailer C │ 8718452095551 │ Farmer cake │
└──────┴────────────┴───────────────┴─────────────┘
\c products
SELECT * FROM products WHERE barcode = '8718452095551';
┌────┬───────────────┬───────────────┐
│ id │    barcode    │     name      │
├────┼───────────────┼───────────────┤
│  2 │ 8718452095551 │ Farmer's Cake │
└────┴───────────────┴───────────────┘

A federated database connection

At this moment, the two databases can be queried separately. But we also want to put data of both side by side. For that, we setup a federated database connection.

\c sources
CREATE USER sources_user PASSWORD 'test321';
GRANT ALL ON TABLE sources TO sources_user;

\c products
CREATE EXTENSION postgres_fdw;
CREATE SERVER sources_fdw FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname 'sources', host 'localhost', port '5432');
CREATE USER MAPPING FOR USER SERVER sources_fdw OPTIONS (user 'sources_user', password 'test321');

CREATE SCHEMA sources_fdw;
IMPORT FOREIGN SCHEMA public FROM SERVER sources_fdw INTO sources_fdw;

Let’s see if we can query the sources from the products database. Note that we do the queries on the sources_fdw schema, because we’ve IMPORTed the foreign tables in that schema above.

\c products
SELECT * FROM sources_fdw.sources WHERE barcode = '8718452095551';
┌──────┬────────────┬───────────────┬─────────────┐
│  id  │   source   │    barcode    │    name     │
├──────┼────────────┼───────────────┼─────────────┤
│ 1004 │ Retailer A │ 8718452095551 │ Cake        │
│ 1005 │ Retailer C │ 8718452095551 │ Farmer cake │
└──────┴────────────┴───────────────┴─────────────┘

That looks good. Let’s check out the query plan for this:

\c products
EXPLAIN VERBOSE SELECT * FROM sources_fdw.sources WHERE barcode = '8718452095551';
Foreign Scan on sources_fdw.sources  (cost=100.00..118.31 rows=3 width=100)                                  
  Output: id, source, barcode, name                                                                          
  Remote SQL: SELECT id, source, barcode, name FROM public.sources WHERE ((barcode = '8718452095551'::text)) 

You can see that the query is executed completely on the sources database (remote).

The best part is that we can now JOIN data between tables in the different databases.

\c products
SELECT p.id, p.barcode, p.name, s.source, s.name AS source_name
FROM products p
LEFT JOIN sources_fdw.sources s ON s.barcode = p.barcode
WHERE p.id = 2;
┌────┬───────────────┬───────────────┬────────────┬─────────────┐
│ id │    barcode    │     name      │   source   │ source_name │
├────┼───────────────┼───────────────┼────────────┼─────────────┤
│  2 │ 8718452095551 │ Farmer's Cake │ Retailer A │ Cake        │
│  2 │ 8718452095551 │ Farmer's Cake │ Retailer C │ Farmer cake │
└────┴───────────────┴───────────────┴────────────┴─────────────┘

Looking again at the query plan, you can see that data is combined from both databases.

\c products
EXPLAIN VERBOSE SELECT p.id, p.barcode, p.name, s.source, s.name AS source_name
FROM products p
LEFT JOIN sources_fdw.sources s ON s.barcode = p.barcode
WHERE p.barcode = '8718452095551';
Nested Loop Left Join  (cost=100.00..139.40 rows=12 width=132)                                                        
  Output: p.id, p.barcode, p.name, s.source, s.name                                                                   
  Join Filter: ((s.barcode)::text = (p.barcode)::text)                                                                
  ->  Seq Scan on public.products p  (cost=0.00..20.62 rows=4 width=68)                                               
        Output: p.id, p.barcode, p.name                                                                               
        Filter: ((p.barcode)::text = '8718452095551 '::text)                                                          
  ->  Materialize  (cost=100.00..118.60 rows=3 width=96)                                                              
        Output: s.source, s.name, s.barcode                                                                           
        ->  Foreign Scan on sources_fdw.sources s  (cost=100.00..118.59 rows=3 width=96)                              
              Output: s.source, s.name, s.barcode                                                                     
              Remote SQL: SELECT source, barcode, name FROM public.sources WHERE ((barcode = '8718452095551 '::text)) 

Entering a custom domain

So far, so good. Now we’d like to add a check to the barcode, so that we can only store valid barcodes in the database. Domain types are great for that. For this example, we’ll require the length to be 8 or longer (in reality one may want to verify the check digit, and normalize it a bit). We’ll add the check both to the sources and the products database.

\c sources
CREATE DOMAIN barcode AS VARCHAR CHECK (CHAR_LENGTH(TRIM (LEADING '0' FROM VALUE)) >= 8);
ALTER TABLE sources ALTER COLUMN barcode TYPE barcode;

\c products
CREATE DOMAIN barcode AS VARCHAR CHECK (CHAR_LENGTH(TRIM (LEADING '0' FROM VALUE)) >= 8);
ALTER TABLE products ALTER COLUMN barcode TYPE barcode;

-- we also need to update the remote table definition
DROP FOREIGN TABLE sources_fdw.sources;
IMPORT FOREIGN SCHEMA public FROM SERVER sources_fdw INTO sources_fdw;

Now this doesn’t really change anything, except that short barcodes cannot be inserted anymore. Let’s see …

\c sources
INSERT INTO sources VALUES (1008, 'Retailer A', '234', 'Dummy');
ERROR:  value for domain barcode violates check constraint "barcode_check"

The same would happen in products. Now let’s see how remote queries are doing.

\c products
SELECT * FROM sources_fdw.sources WHERE barcode = '8718452095551';
┌──────┬────────────┬───────────────┬─────────────┐
│  id  │   source   │    barcode    │    name     │
├──────┼────────────┼───────────────┼─────────────┤
│ 1004 │ Retailer A │ 8718452095551 │ Cake        │
│ 1005 │ Retailer C │ 8718452095551 │ Farmer cake │
└──────┴────────────┴───────────────┴─────────────┘

Same results, nothing changed. But how does the database get this data?

\c products
EXPLAIN VERBOSE SELECT * FROM sources_fdw.sources WHERE barcode = '8718452095551';
Foreign Scan on sources_fdw.sources  (cost=100.00..130.25 rows=3 width=100)
  Output: id, source, barcode, name                                        
  Filter: ((sources.barcode)::text = '8718452095551'::text)                
  Remote SQL: SELECT id, source, barcode, name FROM public.sources         

As you see, there is now an explicit filtering step. And in Remote SQL you can see that all data in the sources table is requested. With just a couple of records, this isn’t noticable, but on a real-world dataset, this can mean millions of records. And in the end only a handful of rows remain after the filter. This is very inefficient, and most queries will timeout before any result is returned.

So even though the barcode domain type is really nothing more than a VARCHAR, PostgreSQL doesn’t know, and takes the safe approach of doing the filtering locally.

Possible solutions

The ultimate solution would be to have PostgreSQL recognize that filtering on the remote domain can well be done. But that would take a while to implement. And perhaps there are cases that I haven’t considered that make it undesirable after all.

One peculiar thing is that when you don’t re-IMPORT FOREIGN SCHEMA after changing the data type, PostgreSQL still thinks it’s an ordinary VARCHAR column, and will do the right thing. So one solution could be to manually CREATE FOREIGN TABLE and use VARCHAR instead of barcode for the data type. This requires keeping the definitions manually in sync.

Another solution would be to create a VIEW on the remote database (sources) that exposes the table without the custom domain. That’s the approach we’re currently using. Since we need to run a script to update federated connection details now and then anyway, we can just as well create a remote schema with un-domain-type-ed views of the relevant tables right away, and use these as remote tables.

Since we want all columns of the sources table except barcodes, we first run a query resulting a single row with a SQL-statement, then execute it with \gexec (using the PostgreSQL CLI tool).

\c sources
CREATE SCHEMA IF NOT EXISTS expose_fdw;
SELECT
  'CREATE OR REPLACE VIEW expose_fdw.sources AS (SELECT '
  || (
       SELECT STRING_AGG(column_name, ',' ORDER BY ordinal_position)
       FROM information_schema.columns
       WHERE table_schema = 'public' AND table_name = 'sources' AND column_name != 'barcode'
     )
  || ', barcode::varchar FROM public.sources)';
\gexec

And then in the products database, we import from the expose_fdw schema instead of public.

\c products
DROP FOREIGN TABLE sources_fdw.sources;
IMPORT FOREIGN SCHEMA expose_fdw FROM SERVER sources_fdw INTO sources_fdw;

EXPLAIN VERBOSE SELECT * FROM sources_fdw.sources WHERE barcode = '8718452095551';
Foreign Scan on sources_fdw.sources  (cost=100.00..118.31 rows=3 width=100)                                     
  Output: id, source, name, barcode                                                                             
  Remote SQL: SELECT id, source, name, barcode FROM expose_fdw.sources WHERE ((barcode = '8718452095551'::text))

Indeed, the barcode filtering happens in the sources database, which is where it is done much more efficiently.