Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
We built our customer data warehouse all on Postgres (tembo.io)
234 points by samaysharma on Feb 2, 2024 | hide | past | favorite | 126 comments


Postgres is great but using pg_cron as your system to pull and clean data is a recipe for disaster in the long run. I've always regretted using complicated Postgres queries or functions instead of doing it in server. Sometimes I do it because I think "wow, I didn't know pg can do this. So cool, let's just use it!". Regret. Every time. They're too finicky, hard to test, and doesn't have the ecosystem like git and IDEs to make changes easily. And the next employee won't know about some random but crucial logic that lives inside Postgres instead of the code.

You're also relying on your Postgres instance's precious CPU/RAM/network to do logic work which can be unpredictable. Let's be honest, most managed cloud Postgres providers charge an arm and a leg for big Postgres instances already. Why use those expensive resources when you can get a cheap EC2 instance or a serverless function to do the same and have the ability to scale efficiently?

My general rule is, use Postgres to store as much of my data as possible, but never use Postgres to do anything other than be a data store.

This rule of mine is why I never fell in love with Supabase even though many developers have. Use Postgres as my permissions layer? As my API server? You have to be kidding. If you're building a quick throwaway prototype, maybe. (The rest of Supabase is really nice though).

PS. To the team at tembo, no one is going to sign up for a free trial without a pricing page.


> PS. To the team at tembo, no one is going to sign up for a free trial without a pricing page.

This, times 1000.

I know setting a price is scary, but it is one of the critical pieces of information for anyone stumbling on your product.

If you are looking for a way to take advantage of traffic but aren't ready for pricing, ask for my email address for your newsletter. That's easier for me to give than my time to "kick the tires" with a free trial.


> ask for my email address for your newsletter

So you’re the one who actually wants that popup? I guess the world really is coming to an end, I couldn’t imagine a single person who actually thinks to themselves “I wish a popup would happen asking for my email address right now.”


Not a popup, please! Just a from please. I want a way to keep on top of tembo that doesn't involve investing my time. Not saying I'd read every newsletter, though.


This 100%. A polite signup form at the bottom of a page (or an RSS feed) is totally fine. At this point I understand what you are about and will consider subscribing to your news.

But those popups that show up after I have read a few paragraphs of into can go to hell. I just leave the page when I see this at this point.


*"form" not "from"


Ugly pricing info here: https://tembo.io/docs/tembo-cloud/billing

We're working this sprint on getting up our "marketing" pricing page.


Awesome, thanks for sharing! Don't think it is that ugly, and definitely shares what folks need to know.

Would add that to the front page as a stopgap :) if I were you.


everyone does it

must work some of the time


Some of the time, companies think it works. I have no faith in how much/whether companies track the number of times a user simply closes the tab because they got a shitty popup while they were trying to _read about the product_. (It happens to me all the time: Hit a product page, semi-curious, I'm reading about it, then BAM, a popup asking me to sign up for a mailing list, so I close the tab and move on.)

Companies are far more likely to say "We got X people signing up for a mailing list, it must be working", and don't bother whatsoever to measure (a) how many of those signups are just bots and (b) how many potential customers they just turned away by nagging them. I would bet any money that these metrics are either not measured or simply ignored.


This rationale right here is why we are stuck with so much bloatware. "People enter the tarpits all the time and never leave. There must be something there, I'm going in"


> no unit testing

I also prefer to do stuff on the server because doing things in the database does invariably seem to drift towards a confusing, brittle mess. However, I do wonder if that's necessarily the case or just a limitation of our frameworks. There's no principle that would prevent someone from setting up a simple test suite for database code. It would be a lot of work, but maybe there's an architecture nirvana right in front of our eyes if someone just were to properly engineer a good framework around Postgres functions.


There's no IDE that can detect Postgres coding errors, no refactoring support, no Git integration.

A lot of the ecosystem has to be built out for me to want to use Postgres functions. The benefit is not there.

Edit: I didn't mean writing simple SQL queries. I meant writing your business/app logic in Postgres functions.


I wrote something [0]. It lets you write plpgsql functions, views, triggers and casts in a way that can be re-created without migration scripts, like editing source code. Just a quick “deploy” and the logic is updated.

It also manages migrations for tables, types, and other stuff in a really simple way. Upgrades are fully atomic, and it lets you write unit tests in SQL - which are run after every upgrade, run inside save points so they don’t affect the database, and can run during production deployments.

It’s sort of my own personal (open source) Swiss Army knife of plpgsql development. It’s a complete work in progress, not production ready, probably has bugs, and needs more and better documentation - but I use it daily. It lets me use Postgres as my main development environment.

[0] https://github.com/pgpkg/pgpkg

(It also lets you import packages from other sources so you can create libraries of reusable code, within some limits)


> without migration scripts, like editing source code. Just a quick “deploy” and the logic is updated.

This is 100% the key to sanity with managing database stored procedures and functions -- ability to manage them in Git like normal code and deploy them like code.

In contrast, the workflow from traditional imperative database "migration" tools is just super awkward for developing and maintaining any non-trivial number of SQL stored programs (procs, funcs, triggers, views, etc).

I wrote a blog post about this a few months ago, and although my product is aimed at MySQL and MariaDB, many of the concepts discussed apply to any relational DB: https://www.skeema.io/blog/2023/10/24/stored-proc-deployment...


Hey - Skeema looks very cool. I haven’t used MySQL for decades but I can see it’s solving much the same problems as pgpkg.

I did decide that declarative tables were too hard when I wrote the first predecessor of pgpkg in bash 10+ years ago - but maybe I should reconsider now that I’m working in Go!

The thing is, it’s a bit of a rabbit hole, there isn’t much tooling around this stuff despite stored functions being so, so much easier for writing database logic than anything else.

I feel the industry has wasted an enormous amount of time on ORMs and other nonsense when stored procedures have been under our noses the whole time.


Thanks! Yeah definitely agree that building out declarative table management for Postgres would be a major effort. A few open source projects I've seen in that area include:

https://github.com/sqldef/sqldef (Go)

https://github.com/bikeshedder/tusker (Python but being ported to Rust)

https://github.com/tyrchen/renovate (Rust)

https://github.com/blainehansen/postgres_migrator (Rust)

Some of these are based on parsing SQL, and others are based on running the CREATEs in a temporary location and introspecting the result.

The schema export side can be especially tricky for Postgres, since it lacks a built-in equivalent to MySQL's SHOW CREATE TABLE. So most of these declarative pg tools shell out to pg_dump, or require the user to do so. But sqldef actually implements CREATE TABLE dumping in pure Golang if I recall correctly, which is pretty cool.

There's also the question of implementing the table diff logic from scratch, vs shelling out to another tool or using a library. For the latter path, there's a nice blog post from Supabase about how they evaluated the various options: https://supabase.com/blog/supabase-cli#choosing-the-best-dif...


I looked at that supabase blog but the problem in my experience is that a refactor is not just about changing the database tables, but also rearranging the data within those tables - the data in a production database is obviously the most important thing. So I think the schema diff is actually the easy part. And for that, I wouldn’t use text tools like “show create tables”, I’d use the PG catalog tables directly.

The problem I had was that I could never come up with a declarative scheme that would allow reliable data transformation for all databases over the long term.

For example - if I have a table “lookup” with two columns “key” and “name”, over a period of years we might see a transformation like this:

    alter table lookup rename column name to description;
      […later…]
    alter table lookup add column name text default description;
      […later…]
    alter table lookup drop column description;
Assuming “name” and “description” are modified between upgrades, then running these updates over time would result in a different transformation than if you just apply the latest definitions to an old database table.

I couldn’t ever come up with a solution to this that’s simpler than a sequence of migration scripts, which is always repeatable. I haven’t had a look at Skeema yet but am curious how you deal with this?


> So I think the schema diff is actually the easy part. And for that, I wouldn’t use text tools like “show create tables”, I’d use the PG catalog tables directly.

re: SHOW CREATE TABLE, I was referring to the export logic, not diff logic. In other words, when a new user adopts a declarative schema management tool, they need some way of dumping their existing database schema to the filesystem as a set of CREATE statements. Most pg tools seem to just leverage pg_dump for this, but in my opinion that's not great since it's an external dependency.

And then ideally there's also a way to re-sync the filesystem in the future as needed, pulling the latest definitions from a given DB server. This is also an export, but it should be smart enough to only overwrite CREATE statements where the corresponding object has actually changed, to avoid stomping on formatting or inline comments.

The ability to do a "pull" operation is very useful in development workflows: engineers can make DDL changes to a dev DB directly while developing a feature, and then pull those changes into the filesystem to turn them into a git commit / pull request. It's also sometimes useful in production workflows, in case someone had to make an "out of band" emergency hotfix directly to the prod DB, outside of the schema management tool.

> alter table lookup rename column name to description

> I haven’t had a look at Skeema yet but am curious how you deal with this?

Skeema just doesn't support renames directly at all. In practice this actually works out fine, since renames are hugely problematic in production databases anyway due to deploy-order concerns: there's no way to deploy an application change at the same exact moment as the RENAME is executed in SQL, so it cannot be performed "online". Best practice with schema changes is for applications to be able to work fine with both the old and new schema, and renames typically break this rule. (Well, unless you do a convoluted multi-step dance with view-swapping, for DBMS that support transactional DDL... this is doable in pg, but not in all other DBs.)

In Skeema, if you really need to do a rename, you can do it out-of-band (outside Skeema) on all environments (prod/stage/dev/etc), and then use `skeema pull` to update the filesystem definition to match.

Or for new tables that aren't populated in prod yet, happily a rename is entirely equivalent to drop-then-re-add. So this case is trivial, and Skeema can be configured to allow destructive changes only on empty tables.


> when a new user adopts a declarative schema management tool, they need some way of dumping their existing database schema to the filesystem as a set of CREATE statements.

Got it. Yeah - tbh I hadn't thought about that, but declarative tables never been on my radar (for the reasons mentioned).

> Skeema just doesn't support renames directly at all

Right! that's one way to deal with it! :)

> there's no way to deploy an application change at the same exact moment as the RENAME is executed in SQL

I guess it depends on your ops environment, but I don't think renames are special; any schema update can be incompatible with client code if you're not careful. That said, in the systems I've worked with (telco and saas), the schema upgrade has always run with applications stopped - the migration code is shipped with the server application, and run as part of its startup sequence. Even if there are multiple clients, the only requirement is that they are all stopped before the upgraded processes start. (This is something you can delegate to K8s for example).

I think the interesting thing is that there are such very different approaches to schema upgrades. Skeema and pgpkg obviously have different views about how migrations should be managed, despite the fact that you (presumably) and I have both done many tens of thousands of them over the years.

One important difference, as you mention, is that PG has transactional DDL - which really changes the game in terms of how you can think about upgrades. pgpkg is deliberately PG specific; it will never support any other kind of database, because the assumptions it makes are invalid in most other systems.


That looks like an awesome tool. I am going to try it out. Gave you a star!


Awesome, thanks! I update it whenever I work with it, which has ramped up in the last few weeks (added cast support a few days ago). It uses pg_analyse to parse the SQL and uses a few tricks to get everything updated.

The best thing (IMO) is that there is basically no funny stuff, no filename conventions, no funny delimiters. It’s just regular Postgres SQL, and a couple of very small config files. Works perfectly with git.


I've made some automatic schema dumping scripts for supabase/postgres code to become searchable and git-diffable and PR-friendly. And I eventually was able to make unit testing to work good enough that I even wrote migrations in a test-first way, it was just quicker to iterate. But overall experience felt like I'm constantly combating problems, solved looong time ago for other languages and ecosystems. It was weirdly fun, but what killed my interest is that row level security policies kill performance of even simple queries so much, and EXPLAIN doesn't help to well with it.


Had the opposite experience. Row-level policies had minimal overhead (<5%?) but improved security at the app layer tremendously. It was intensely satisfying to have cases where UI developers were complaining that data was missing. Turned out the access tags were wrong, and the data as tagged shouldn't have been visible in the first place. App had set the wrong tags but turns out would have happily returned the data if the policies hadn't been there.

Apps forget that extra AND on the WHERE clause *all the time*. Just one ad hoc script querying the database can ruin your whole security-oriented day.

Do policies make schema design slower? Yes. Do they make queries slower? Not in my experience, but that may be due to our familiarity with Postgres and its planner. Do they basically eliminate data leaks to the end user? Absolutely.

DB policies to me are like Rust vs C++. Someone maybe able to write C++ faster and with less training, but having those extra checks at the outset can save so much time and heartache down the road. It's an investment, not a cost.


> what killed my interest is that row level security policies kill performance of even simple queries so much

That's shocking to hear

Do you feel that doing access control outside the db is faster overall? (considering it most likely involves more round trips into the db)


Is it on github?


no. it was a private project, and my solution wasn't general enough.

for functions/views/etc: I've used https://github.com/omniti-labs/pg_extractor then removed roles from dump (because the fluctuated between developers) and executed

  perl -000 -i -lne 'print if ! /^--/ && ! /^SET /' `find schema/ -name '*.sql'` || exit 1
to remove a lot of useless comments and flags from the dump (pg_dump output isn't too readable). This oneliner can strip too much, though, comments in functions shouldn't start from the 0 column.

The same script had been run on CI too, to verify that developer didn't forget to run it in the PR.


Have you tried IntelliJ? I honestly do most of my SQL in psql, out of habit. But I'm in IntelliJ all the time for Java, and from what I've seen their query editor is way impressive.


Second this but for Rider, should be the same engine I think.

Jetbrains are really kicking some goals.

Scratch file with a quick query in about 2 seconds is amazing


> There's no IDE that can detect Postgres coding errors, no refactoring support, no Git integration

JetBrains DataGrip does all of that.


This is a big problem, but I am encouraged by this project. https://github.com/supabase/postgres_lsp


Does it support multiple queries per sql file? My use case is to use this to autocomplete queries made with Cornucopia [0]

Cornucopia queries look like this (here is a something.sql file)

    --! authors
    SELECT first_name, last_name, country FROM Authors;

    --! insert_author
    INSERT INTO Authors(first_name, last_name, country) 
    VALUES (:first_name, :last_name, :country);
There are multiple queries each separated by ; and on top of each query, there's a comment giving a name to the query (it's more like a header)

I think the only thing that might require specific support in postgres_lsp is using the :parameter_name syntax for prepared statements [1] (in vanilla Postgres would be something like $1 or $2, but in Cornucopia it is named to aid readability). But, if postgres_lsp is forgiging enough to not choke on that, then it seems completely fit for this use case.

[0] https://github.com/cornucopia-rs/cornucopia

[1] https://cornucopia-rs.netlify.app/book/writing_queries/writi...


Thank you for turning me on top Cornucopia, it looks awesome. I've used the very similar aiosql in Python, but I hadn't realized there was a Rust analog.

To tell the truth I've been waiting for postgres_lsp to mature before trying it out, but based on this example [1] I think it does support multiple queries.

Since it uses a parser extracted from Postgres, the nonstandard syntax would probably trip it up, but there's probably a way to fix that.

[1] https://github.com/supabase/postgres_lsp/blob/main/example/f...


There’s at least one PL/pgSQL linter out there.

When I worked on a system that used a lot of postgres triggers and stored procedures we built a little mechanism on top of our existing database migration tool that would check a directory of plpgsql files and generate migrations when they were created or updated. It worked fine.

It wasn’t the most perfect developer workflow, and I was suspicious when I first encountered the way the software used all of the stored procedures, however I came to appreciate that we were able to be a bit freer with changes to the application code because of this semi-isolated layer that took care of some critical stuff right in the database.


There are tools for Postgres unit testing https://wiki.postgresql.org/wiki/Test_Frameworks. Which is not to say there isn’t any room for improving them.


Write more Postgres functions to unit test Postgres functions. :))


Isn't that normally the premise of testing? Eg writing Python functions to test other Python functions?


Yes but Python is nice to write. Postgres functions are not. At least not to me.


Depending on what kind of deployment you have, you could use Tcl, Rust, or even Python if you could use untrusted extensions. (Not a comment on this particularly testing framework, but Postgres server-side programming more generally.)

But I hear you, PgSQL can be very annoying and unergonomic, and it's not a language most people you're hiring will know upfront. Pushing things onto the backend isn't unreasonable. When I write tests for PgSQL, I write them in Python and run them from the client side, not on the server.


pl/pgSQL is very good (and ergonomic) for a logical extension to SQL, aka set theory programming with intermediate state. It isn't and was never targeted toward general purpose programming like Python.

That said, 100% agree that unit tests should live outside the DB. Querying for sets inside or outside makes no functional difference, and your DB doesn't need all that extra cruft.


There's maybe no standard solution which could drive this forward as a practice, the way JUnit has taken on the Java community (I'm sure there are examples like that everywhere that may resonate). But nothing is stopping you from running Postgres in a Docker container, and wrap it in a unit test with the technology of your choice. The Postgres scripts (ddl, triggers, functions) could live in the same project. Setup, run tests, teardown. Each unit test can set up the data for a particular test case. That way you also codify the Postgres scripts, for which it then becomes "just" a deployment matter to get them from Git into the target environment. I have written all sorts of test harnesses, and I wish more people would think out of the box of their standard tool sets. Sure, I'm a Java developer, or one could be a SQL coder, but nothing is stopping us from writing a darn script.

(By the way, I'm not suggesting running a large data warehouse in a unit test. But the parents said "unit testing", which suggests there exist units of smaller, isolated functionality in your project, if you're willing to find them.)


You have just described Testcontainers [1] , and if you are a Java developer you may want to look into Testcontainers for Java [2].

For example, in one of my projects I use TestNG to instantiate a MariaDB container, run the Flyway migrations on it and then populating the tables with the test data:

  mAPIDBContainer = new MariaDBContainer<>(DockerImageName.parse(MARIADB_CONTAINER_TAG))
              .withDatabaseName("apidb")
              .withPassword("password")
              .withNetwork(containerNetwork)
              .withNetworkAliases("apidb")
              .withExposedPorts(3306);
      mAPIDBContainer.start();
      Flyway flyway = Flyway.configure()
              .dataSource(mAPIDBContainer.getJdbcUrl(), mAPIDBContainer.getUsername(), mAPIDBContainer.getPassword())
              .encoding("UTF-8")
              .locations("classpath:apidb/migrations")
              .load();
      flyway.migrate();
      ScriptUtils.runInitScript(new JdbcDatabaseDelegate(mAPIDBContainer, ""), "sql/apidb/apidb-test-data.sql");
[1] https://testcontainers.com/ [2] https://java.testcontainers.org/


> But nothing is stopping you from running Postgres in a Docker container, and wrap it in a unit test with the technology of your choice

I agree. I'm in .NET land, I use EFCore as my ORM, and I use EFCore's migration features. My ORM models are in a separate project (same git repo) from both the web server project and test project, and any hand written SQL gets added to the migration scripts that EFCore generates.

I spin up a docker container for postgres, my test code clears any existing DB, creates a DB in the container, and then runs the EFCore migrations on the DB. I have simple tests that make sure my CTEs are working correctly and that things like expected unique indices are also setup.

This works both locally and in Github Actions.

I just wouldn't call any of this a "unit" test. I put all this squarely in my integration test suite. I figure if IO is happening as a consequential part of the test (i.e. not setup/teardown), it's an integration test. I wonder how much that distinction is tripping people up? A lot of people think of unit tests as small, independent, and quick, so by only thinking about unit testing, they automatically rule out tests that have app code call out to a DB.

Based on the sibling comment, I'm going to have to take a look at Testcontainers. I'm not sure how much it'll simplify at this point, but who knows!


You lost me at ORM. Once an ORM is in play, you're in integration test territory in my opinion, not unit test.


In the MS ecosystem they have the SSDT data tools and unit test projects. It uses the designer functionality of VS to allow you to write your arrange, act, and teardown in pure SQL, and gives some basic condition builders for result sets tor your assert, with an escape hatch if that's insufficient.


> Use Postgres as my permissions layer?

It's actually so good I regret not having done it sooner. We migrated all our APIs into PostgREST [1] and it reduced our development/testing/deployment time by an insane amount, decreased the number of bugs automatically by removing 1 extra layer and made all our requests faster without having to optimize anything.

I'm genuinely surprised it's not more popular. It would probably benefit like 99% of the small/medium size companies.

Writing SQL is kinda fun too.

[1] https://postgrest.org


We've integrated postgrest into Tembo Cloud already too, one click to deploy it as a sidecar.


Tembo CEO here

I would have agreed w/ you 100% 18 months ago. And presently is is messy do to this in Postgres, but it's also messy to choose lots of tools that create sprawl and data islands.

We're spending a significant part of our R&D effort in 2024 towards making it cleaner/safer to do more in Postgres. We have a lot of work to do to build a great gitops developer experience. We're also prioritizing auto-scale, auto-suspend, and auto-tune projects to help reduce the cost of doing "lumpy" things in Postgres.

Would love to have you as a design partner @aurareturn — want to eventually win you over :) hit me up if you're interested in chatting ry [at] tembo [dot] io


+1 for data in the database, code as... well, code. And also learned this the hard way. Code you can build and run test suites against, extract logs from, place debug breakpoints in, is such a quality of like enhancer.

edit: typo fix


We're working on a gitops DX for Tembo Cloud, that will allow you to ensure everything you build on Tembo is in source control, and runnable locally w/ our CLI. More info coming soon...


Oracle also lets you run Java from stored procedures which I’m sure lets you tie yourself in knots too. There could be slim use cases for doing everything in the db but there’s a reason why it’s not widespread.


Thanks for the feedback. We're actively working on implementing the pricing page for our website. It should be out in the next few days.


I hear your points, but can't git, CI tools, IDEs like DataGrip all be used to test and deploy the database code too versus letting some db admin create functions or procedures with no version control? Also, with things like RDS or readonly replicas, couldn't more analytical queries be done directly in the database versus busing data around?


I generally agree. I think one case that it weren't true was LISTEN/NOTIFY https://www.postgresql.org/docs/current/sql-notify.html - the API is actually simple and well tested.


Note that LISTEN/NOTIFY isn't compatible with 2 phase commit though, which can potentially rule out its use in some situations.


and doesn't scale well. Meaning using it can take your Postgres server down way before regular SQL workload would.


Yeah except for serious data warehouses there's so much data you have to bring the compute in because getting data out is infeasible.

If that's not the case, you aren't running a data warehouse, just a (big, maybe) database.


There will be another blog from us at some point about running the data warehouse at scale. We're already working on integrating with s3 storage, and distributed compute is in the roadmap. Both possible today with open source extensions, and our friends in industry are already doing it.


Is there any good data movement/transformation software for on-prem? Specifically integration of other databases and Excel into MSSQL (<100GB of data).

I have looked at SSDT, which looks really clunky... Was going to use a similar approach to OP, especially because it's cheap and ressource-efficient, but lack of support infrastructure might come back to bite me.


Checkout Airbyte and singer.io for data movement. They're tailored for moving offsite data local, but also have support for moving data between different database vendors. I'd recommend starting with Airbyte.

For transformations I'm a sucker for dbt-core, which executes transformations as SQL against your data warehouse vs. a separate process. This keeps everything simple and (generally) snappy. It uses the Python Jinja2 template library to compose SQL templates, and the process that parses templates to SQL also implicitly creates an execution DAG so you could have hundreds of transformations running in the exact order they need to.

There are also many commercial offerings (Fivetran, Dataform) that don't tick self-hosted, but could be easier to deploy if getting internal hardware support is an issue.

https://github.com/airbytehq/airbyte

https://airbyte.com/connectors

https://www.singer.io/#taps

https://github.com/dbt-labs/dbt-core


I like Pentaho Data Integration. The community edition does everything I need but there is also a paid version.


Pandas? Polars?


SAS.

Just kidding. Fuck SAS. I’ve use some absolutely dogshit tools in my day, and of all those, SAS is downright the most time wasting, incredibly inconsistent piles of shit ever.

Never had a worse experience. Useless. Run far away.


Well put. It's never not funny to me how one of the fundamental tenets of software engineering is 'separation of concerns' yet there's no shortage of people willing to say 'but it's so convenient to mix them at the start of a project...'.


This is the definition of vendor lock in; the more you use embedded non-standard functionality of a product the less likely it is that you'll ditch it.

As long as you're aware of what you're doing it's fine. And you're documenting your usage, right?


If you don't write your database layer with vendor-specific features and optimizations in mind, you're definitely leaving tons of performance on the floor. Multi-DB support for a largely identical schema is a fool's errand.

It's not Postgres's fault that MySQL is comparatively bereft of modern features.

    • transactional DDL
    • RETURNING clause
    • unfettered use of user-defined functions
    • a native UUID type
    • the ability to define custom data types and domains
    • DDL triggers
    • row-level security
    • actual arrays and booleans
    • MERGE
    • range types and constraints
    • IP address and CIDR types
    • partial indexes
    • LATERAL JOIN
MS SQL has temporal table support, the ability to integrate any .Net functionality directly into the database, pivot tables, etc.

It's really hard to explain to lifelong MySQL users what they've been missing in the name of database agnosticism. Using the lowest common denominator to all the engines feels like wearing a straitjacket attached to lead weights.

https://www.sql-workbench.eu/dbms_comparison.html


> vendor lock in

Not sure the word "vendor" is exactly correct when using standard PostgreSQL functionality though, as PostgreSQL has multiple vendors providing that functionality. So, not really "vendor lock in".

The term is clearly correct for databases with only one vendor though (Oracle, etc).


> And the next employee won't know about some random but crucial logic that lives inside Postgres instead of the code.

The next employee should know this, because it will have been documented somewhere.


I love Postgres and relational databases in general, but the lack of documentation generation found in general purpose programming languages for the last 20 years is a glaring omission.

There simply are no good equivalent to Javadoc for SQL databases. (Except you Microsoft. I see you and MS SQL Server tools. I just don't work on Windows.) You have ERD tools that happily make a 500-table diagram that's utterly useless to humans, are ugly as sin, are not at all interactive, etc. Seriously, Javadoc (and Doxygen, etc.) have been on point since the 1990s.

OOP and functional design patterns are common knowledge while 50-year-old relational concepts are still woefully unknown to far too many data architects.

Folks don't write docs. They just don't. They start. They agree it's important. But they either get skipped or they stagnate. The only real option is automatically generating. Reading the column and table comments, the foreign key relationships, the indexes, the unique and primary keys, the constraints, etc. and rendering it all into a cohesive and consistent user interface. Not the data. The structure.

It's the single biggest tooling failure in that sector in my opinion.


Agreed, I learned the hard way that complicated constraints, those that are really good to keep data integrity, are really hard to test and might have unintended consequences.


Hard to test?


A data store *that enforces valid data at all times*. Far too many people miss that part. They skimp on foreign keys, check constraints, and the most appropriate data types. Time is far better spent getting the schema right at the outset than trying to fix bad data already mixed in haphazardly with the good.

And then there's audit trails, which are best made centralized in the database with triggers. "This change was made by X at this time."

Row-level access policies are yet another. Doesn't matter which app or user is querying, what joins are mixed in, etc., the DB simply omits data the user shouldn't see. Period. No "oh, my script forgot to include part of the WHERE clause for that".

Totally agree about pg_cron (and MySQL scheduled events while we're on the topic). It artificially limits scheduled actions to just those that affect the DB, cannot adjust to current DB load, and are hard to find. Better to run cron jobs all accumulated in one place from the cloud provider a la EventBridge or from a dedicated schedule server. (I'm not a fan of app cluster leader elections either for complexity and opacity reasons as well. Put the schedules in one place to be audited and enumerated as needed.)


What exactly is tembo? and how does it compare to a supabase? You have me curious now.


we just launched earlier this month - manifesto: https://tembo.io/blog/manifesto


I would not recommend this approach to anyone, for these reasons:

- If your data is small enough that Postgres works as an analytics warehouse, then the cost of running a warehouse like BigQuery are going to be small too, and without the complexity of managing the infra yourself. The company here specializes in Postgres, so it makes sense for them to do this, but for everyone else it makes much more sense to use a warehouse so you can focus on building your product and not worry about scaling your analytics infra.

- Because Postgres doesn't scale for analytics, they only keep 30 days of metrics. I don't think this would be acceptable for most businesses. BigQuery costs 2 cents per GiB per month, so storage is generally a minimal expense until you get to pretty large scale.

- Tools like Fivetran and Airbyte are much simpler for maintaining these pipelines than setting up FDWs, and each support a huge number of data sources. If you're worried about cost, you can self-host Airbyte or use Stitch (which is kind of a budget player in the space). It's true that Fivetran can get very expensive as you scale, but there is enough competition now that you have options to reduce cost.

Disclaimer: I run a data consulting company that specializes in zero to one data projects for startups, including setting up analytics warehouses.


Counterpoint:

- The cost of running a warehouse like BigQuery is pulling in Google Cloud, or whatever 3rd parties like the ones you mentioned, and integrating with it somehow.

How you gonna get the data in there and back out? What should the ACL be? Sure might be a trivial amount of data, heck, could be 1GB. Depending on the type of data (and company) there's potentially a lot more busywork involved here than just shuttling files around and trivial scripting.

"Postgres doesn't scale for analytics" and "you should not spend money on an internal guy/team to run a database, you should blow it on Google and data consulting companies instead" is a bit of a blanket statement.

Seems like teams with great DBAs will go with what they are familiar and teams that are cloud-all-the-things will go with what they are familiar. But the real choice here isn't even technical in the narrow sense, it is strategic.

You wouldn't recommend anybody use Postgres, really? Man, so many teams are doing it wrong I guess.


I can qualify "Postgres doesn't scale for analytics" a bit: You cannot run large scale aggregations in a row-based database like Postgres the way you can in a columnular data warehouse.

If all you need is daily aggregations of metrics like counts of certain events, then Postgres will be fine for a long time. But as soon as you want to ask more complex questions like 'What is the average time between a user signup to the first time they do a certain event?' or 'How many distinct users did X in the past month?', Postgres is going to struggle. It's not impossible to get Postgres to do these things, but every additional question like this will require figuring out how to get Postgres to do it in a reasonable amount of time, whereas a data warehouse is designed to do these kinds of aggregations very efficiently.

In terms of the integration, I think you'd be pleasantly surprised by how easy these tools make it to pipe data into your warehouse. Outside of setting up permissions, you don't really even need engineering. A strong data analyst could do the job.

For startups, engineering time is often the most scarce resource. I strongly believe that startups should not spend time building custom analytics solutions when so many great tools exist off the shelf, and instead focus 100% on building the core product.


Totally agree with the problem, and need for a columnar store. But isn't BigQuery too slow for powering an interactive website? My goal building a dashboard type tool is to have all the queries return in <1 second. I didn't specifically try BigQuery, but after reading about it I went with Clickhouse mirroring a postgres instance. This seems to scale perfectly pleasantly up to tens of millions of rows of timeseries data.


Yeah I agree with you wouldn't want to point a dashboard directly at BigQuery. There are two cases here:

- You don't need realtime data, but you want to serve the data quickly. If you wanted to use Bigquery in this case, you'd have the query running on some schedule, and then store and serve the results from whatever data store you like. This is what most dashboarding tools are doing under the hood (e.g. Mode, Metabase, etc.)

- If you need real or near real-time data, then an OLAP system like Clickhouse makes a lot of sense!

I'm definitely not trying to say that BigQuery is right for every situation. It's more that there are very few situations that I can see Postgres being the best workhorse for your analytics.


> In terms of the integration, I think you'd be pleasantly surprised by how easy these tools make it to pipe data into your warehouse. Outside of setting up permissions, you don't really even need engineering. A strong data analyst could do the job.

Sorry I tried to say as much myself, maybe I wasn't clear. More succinctly put the busywork tradeoff is between keeping your data internally and paying down the required DBA legwork or plunking it into a third party which often has implications beyond the technical and creates work for lawyers and managerial types (if its HIPAA type stuff, sensitive or involves multiple jurisdictions) plus:

"you don't really even need engineering" - one wonders what I need the data consultant for. Another cost. Another trade off. Another point of possible failure.

Somewhere on that continuum is an infliction point and clearly this does not hold: "I would not recommend this approach to anyone" since "If all you need is.."

P.S. - I used to work in AdTech as a consultant myself and was shown early previews of things like Lambda and BigQuery. ;)


> will require figuring out how to get Postgres to do it in a reasonable amount of time, whereas a data warehouse is designed to do these kinds of aggregations very efficiently.

"reasonable amount of time" is speculative term, it could be good enough for some usecases.

Also, you can do indexes in Postgres, but can't in say BigQuery, meaning it is more likely I can build low latency analytics in Postgres but will have issues in BigQuery.


Redshift: columnar database with a Postgres query interface.


There are definitely ways to cleanly make Postgres scale for analytics. We didn't discuss in this blog, but we will be writing about them in the future. For example, check out what the folks at ParadeDB are doing. https://github.com/paradedb/paradedb. Neon is doing an awesome job separating compute from storage. Supabase contributed foreign data wrappers make it super easy to read from S3 into Postgres. Lots of great work going out there :)


I totally agree that there are great solutions out there to use Postgres for specific analytics tasks, especially realtime ones, but if you want something that can handle arbitrary aggregations and do ad hoc analytics, nothing comes close to a standard data warehouse. Lots of exciting things happening in the space so that might change!


One of the authors of ParadeDB here. While ad hoc aggregations in vanilla Postgres are slow due to lack of column-oriented storage, extensions like pg_analytics are addressing that problem. In our view, one of the main use cases of a data warehouse is when you need separation of storage/compute, which enables distributed analytics and scalable storage. With Postgres, query processing + storage are all happening on the same node. That being said we're looking at ways to separate out storage to an external data lake as part of pg_analytics.


Very cool! I’ll be super interested to see how this develops.


Very good response, can I ask you for a way to find your company? I am working in the same space.


The first data warehouse at Gitlab was postgres! It worked quite well until we started pulling Snowplow event data.

I share some of the concerns of other commenters of having postgres do everything. Having a separation of concern around orchestration in particular seems wise. That said, getting something up and running using just pg makes sense and it's always possible to refractor.


I'm always amazed (and a bit frightened) by the amount of logic that can be implemented directly in the database. How is the code debugged / managed / versioned / deployed ? I would be thankful for any pointer to books / blog posts about that.


I work with Oracle daily and implement a lot of logic directly in the database.

Debugging: Extensive logging to tables [0]. Also we have dev, test and prod databases.

Versioning: Git. It's just source code that gets compiled in the db.

Deployment: Upgrade SQL scripts. You already have to do this on any relational DB if you need to alter existing tables. We just also deploy new/updated packages. We trigger them via pipelines.

Also keep in mind that the logic you deploy in the database is generally not as complex as other software as you mostly just query, modify and write highly structured data.

But we still run plenty of tests. There is a great unit testing tool for Oracle: utPLSQL [1]. We also spin up databases and run the installation and upgrade scripts on pull-requests.

[0] https://github.com/OraOpenSource/Logger [1] https://github.com/utPLSQL/utPLSQL


In data engineering, there are frameworks like DBT that do exactly that. In fact, these are industry standards and the recommended way to do transformations and cleanups nowadays. This is essentially a mix of sql and jinja (and yaml files, for variables), you can create your own macros, it comes with it's own testing framework and also strict sql code formatters. Fits git flow quite well. The rationale is that it enables data analysts (data analytics engineers) to do quite sophisticated stuff still using sql. Also, if you are operating on datasets that are larger that a single machine can process, doing it in sql and passing to MPP engines like BigQuery and Snowflake are probably the only way to do it with relative ease.

In any case, this is for data engineering only. I wouldn't imagine doing this for live production stuff.


I personally put some logic in the database especially when I’m expressing constraints. If it’s there it means another engineer can’t go directly to the database to bypass these constraints. (By logic, I’m meaning for example a transaction can only transition between states if it’s in a required state).

When it comes to debugging, versioning, deployment all these live alongside the code and are managed via migrations. Testing it is done as an integration test with the rest of the system.

It helps that we don’t use an ORM and deal with SQL everywhere.


In my experience, chances are that the database will outlive whatever application code is layered right on top of it.

So ensuring the database itself protects the data integrity and prevents the application code (current or a future refactor or rewrite) from messing it up, sounds to me like the sane thing to do. Be it with triggers, with functions, or whatever.

Though I can understand that people usually don't like how PL/pgSQL looks like (I don't). But if you ignore the ugly language syntax, testing it is no more difficult than testing, say, an AWS Lambda function that is triggered by SQS and writes stuff to DynamoDB.


I've been designing a somewhat trivial application this week - representing time series data in a reliable manner (basically a holding pen for stuff that will land in opensearch for good visualisation tools).

At one point I was thinking "well I can put that column in the main table so long as I don't fire the 'when_changed' trigger if there's an insert/update on any other column. After about three minutes, I decided the design needed normalisation after all ...

Last year I moved a mostly small but very non trivial database from oracle to postgres. And I cursed the name of every developer who decided on non-trivial logic inside the database along the way.

A few years ago I made some expiry logic inside of some postgres triggers, and it worked really well and was rock solid. However we moved it out of the triggers into the application PDQ because it would never have been resilient changes in requirements. Nonetheless, prototyping the logic in postgres was good, but it absolutely did not belong there for the long run.


You might find some info in the docs of PostgREST [1] or in the previous discussions on HN about it [2].

For the versioning, I just have a git repo where I keep the definitions of every role, schema, table, view, function, trigger, grant, policy, etc. Every time I change something in the database I first change it in the git repo too to not lose the history. It also helps as a reference for future development, like if I need a trigger function I can just search one in the repo and copy/paste it.

[1] https://postgrest.org

[2] https://hn.algolia.com/?q=postgrest


Put your functions/procs into files. Check them into git. Use Liquibase (or flyway, etc.) to wrap the files as migrations with rerun-on-change. Deploy by running the Liquibase cli, or directly from your app on startup.

For testing, write some tests and run them. I used ut_plsql when I was working with Oracle.

For debugging, a log table is easy. Have a log() function that inserts into a log table inside a tx-new. You can also use an interactive debugger. Postgres, Oracle, and MS all provide gui debuggers. They aren't as advanced as IntelliJ, but they let you set breakpoints, inspect variables, etc.


For the versionning part there is usually tools in your language. For ex: Flyway for java, diesel_migration for rust ...


I don't know... the places I've worked where there was that much of the application shifted into the database layer, there would be no Flyway. You'd be happy to get write permissions, let alone permissions to update the structure. Perhaps correlated, those organizations managed the versioning part through a heavy-handed change management process, ie. humans. Perhaps in some places it's cool now to do this, but for me having the entire application logic modeled in the database will always be associated with painful enterprise culture.


We're already using Liquibase where I work (I don't remember exactly why it was chosen over Flyway.) I also worked at a company where every DB access had to be done with a stored proc. The stored procs would be reviewed by the DB team. They were versioned by using a version number in the name, like GetUsers_v1, and multiple versions could exist at any given time in the DB.


liquid base (or something like it) and a bit of forthought is the answer. Change your databse an order of magnitude slower than your higher level code bases. I do think the trend to nosql and document oriented db's was a results of people seeing just the sorts of messes you can get into with things like Oracle and Pg with stuff over the longer term.


Ok, on balance I do not advocate putting logic, particularly iterative or nontrivial parsing logic, in the database, but the more analytical and SQL-oriented/friendly the logic is and your infra/data is, the more tempting it can be. I have gone down this road and while I debate its merits, I also think it’s under-rated/under-tried.

The key missing tool for a conventional programmer interested in the topic to consider is liquibase/flyway. More on that in a bit.

How is db code debugged? Print statements and/or log/warn/error tables populated by a simple logging stored procedure you sprinkle in your code. Plus intermediate tables that contain intermediate state of a computation/data-wrangling. The former is crude vs IDE step-through debuggers but workable; the latter is (arguably) better than most programming languages which don’t let you retrieve intermediate RAM state or let you inspect/query them in as flexible a way. Would I rather pore through gdb dumps (or pickled serialized custom checkpoints from some language’s data structures)? Or query tables? Hmm…

You can also debug by creating TDD test frameworks for your logic-encapsulating stored procedures (“sprocs”). For each sproc, you create three small test sprocs. 1) a mock data setup sproc which idempotently inserts mock test data needed for testing different scenarios your code will encounter 2) a mock data tear down sproc which removes the test data and 3) one or more test execution sprocket which first calls sproc#2 then #1 then calls your main logic/state- changing sproc with whatever input parameters you want to check, and then inspects the resulting output values or database state changes and emits/returns testname, PASS/FAIL, and failure reason message as its return values or as its dataset it returns.

Write your test sprocs first, then run an empty test stub of your main sproc code which should fail the test, then write+edit+debug your code until it passes the tests. Presto, debugging database code TDD-style!

How is db code managed/versioned/deployed? In git, with liquibase/flyway called by your CICD process (Jenkins with maven+liquibase for Java apps, Jenkins+liquibase CLI for other types of apps.)

Liquibase lets you define+execute a series of SQL statements as a series of “change sets”. (The changeset definition and properties are configured via structured sql comment annotations before+after one or more sql statements. These statements are within an otherwise conventional “.sql” script that is then read+parsed+executed by a liquibase executable/.jar called by maven/CICD/etc. Liquibase maintains its own private state of whether a changeset has run or not, and you can annotate with each change set definition whether that change set “runs once”, “runs on change”, only if the sql statement was edited since last run (ie liquibase detects its hash of that sql statement code changed) or “run always”. If your .sql bombs out in the middle, liquibase-executed.sql (unlike a conventional .sql piped to your database) just starts off where you left off code+data deployment-wise when it runs the second time, since it knows which changesets have exited successfully and you’ve effectively annotated which should rerun or be rerunnable.

Given all that, you create a master list of .sql files, run through them all each CICD build/deployment with liquibase. Most DDL table creation sql in your .sql code should be configured to be changesets annotated to run once, inserts of reference data likewise, permissions, grants, user creation, etc. similarly. To edit those after they’ve run, just add ALTER SQL statements as a later changeset. Slightly differently, stored procedure or SQL VIEW (re-)creation would be annotated to “run on change”, so if liquibase detects (via hash) you’ve edited that sproc it redeploys it, otherwise it skips rerunning/redefining it. Thus workflow-wise you edit files with that sort of code much like you would any more conventional programming code. Your test suite sprocs should “run always” presumably.

Convention-wise, to make code manageable, I put chunks of related sql in similar files, also putting stored procedures in different files than ddl since that fit my mental model best, and put execution order number prefixes in my liquibase .sql filenames to make the mental model of required/desired execution order very explicit. 1_schema_setup.sql, 2_user_setup.sql, 3_initial_table_setup.sql, 4_initial_data_load_from_csv.sql, 4b_core_views.sql 5_config_sproc_test_suite.sql, 6_config_sprocs.sql 7_core_sproc_test_suite.sql 8_core_sprocs.sql 9_<major_v2_feature>_setup.sql, 10_<new-non-core-oriented sproc>_test_suite.sql, etc.

In theory, if the cumulative DDL gets too complex, you can just reverse engineer a clean db schema and refactor/blow away all the delta-type code.

Liquibase annotations also let you have preconditions and postconditions for each changeset that you can configure to skip execution, fail the change set/job, or execute rollback or other arbitrary sql. So before you have liquibase do some expensive or nonidempotent operation, you can pre check via your own sql if it was done already if you want to be safe or assert some precondition that must be enforced before safely proceeding. When defining an sproc in a changeset, you can configure a post-condition check if the related test sproc returned “PASS” and onFail then run the rollback sql for that changeset which could be basically a copy of the earlier sproc definition code.

Anyway that’s what I did on a team that had (relatively) high engineering standards. Never did write it up in a proper blog post so the above is not quite a cookbook but should give you a flavor of what is possible.

It does take a bit of an app developer + db developer mindset to appreciate/internalize though, and many people are one or the other.

The context of this effort was some SQL code that was the heart of an analytics signal detection engine using stored procedures running over a data warehouse coupled with a Scala app that ended up scanning over its lifetime tens of billions of dollars of big pharma orders for “unusual” orders needing human review. So it can be done in a real production app running over some years with enhancements.


This was probably my longest HN post, when the requestor only asked for a link to a blog/article. So I did create one with the above content not buried in a thread: https://medium.com/@gregwdata/treating-sql-like-code-1030328...


Awesome, thank you!

> In theory, if the cumulative DDL gets too complex, you can just reverse engineer a clean db schema and refactor/blow away all the delta-type code.

We use Liquibase at $WORK and I often end up playing the whole changelog locally to then inspect the results on the DB server. Recreating a mental model of the DB structure by reading the changesets gets really hard really fast.


Thanks!

Yeah, mental models of db/DDL structure via changesets is really only meaningful the original engineer. The difficulty of that is why I never maintained views or stored procedures or user permissions as delta-like changeset. They were runOnChange and grew within their own fixed files that I occasionally refactored so I didn’t have to think about deltas for those (that change set history could be seen in git if needed.)

(I mentioned “run once” in my original post, but liquibase actually doesn’t have that; I used runOnChange with preconditions which would MARK_RAN if a sql statement revealed logic had run before already.)

I wonder in hindsight if I just could have reverse engineered the database once a month (or per major release) and dumped that into a git folder to have a current view of data structures that could easily be seen or consulted…


> This becomes a huge cost in the form of cognitive overhead for the team in addition to the time and resources it takes to set up, manage and maintain it

The problem with using one system that you know well for everything is it becomes necessary to build (usually suboptimal) systems that emulate well known ones. In the end you trade one snowflake for another.

That being said, there really ought to be standardized approaches to auth, backup and operations as various services today do things very differently and this does indeed increase cognitive load.


This. If it weren’t for auth and backups, “docker run prometheus” could be 100 times easier than coming up with your own schema and setting up foreign data wrappers in your already running postgres. Faster too, row vs column storage, etc. Prometheus integrates with every other tool in existence so there is very little reason to wrap it and loose those integrations.

With FDW you don’t get off the hook of operating the backing data store anyway. You only get a consolidated query frontend.

Kubernetes has solved a lot of the operational hassle of running multiple services. Standardized storage is still a nirvana, if you want to care about consistency, knowing if your database has a WAL or if you need to flush before snapshots, point in time, delayed replication and all that. Getting better, with many new databases connecting straight to S3 or equivalent.

Standard auth was supposed to be solved but oauth2 is still a fragmented, difficult mess, often only part of the enterprise version.


I wish you well with FDW. I thought postgres_fdw seemed great and tried using it to construct cross-database queries. Unfortunately it performed like a complete dog for any kind of batch DML or mildly complex batch select statement, despite me trying every possible trick I could find online. Apparently it's a known drawback with postgres_fdw, supposedly fixed in later versions (which my organisation wasn't ready to upgrade to). Finding much pertinent information about that particular problem either in PG docs or online was a challenge in itself.

It turned out to be far quicker, if less than ideal, to have a program with a connection to each database act as middleman because the queries it ran were locally optimised and could be tuned properly.

All this may, of course, be a sign of a major gap in my Postgres-fu and I accept all derisory replies with appropriate humility.


I definitely agree that using a tool like Fivetran is generally needless complexity when FDWs would get the job done.


One thing that I think would be awesome is if there was a Postgres extension for streaming queries, where you could join tables with millions of rows and when any of the upstream data is changed you’ll be notified of the delta changes.

I tried Materialize but it seemed incredibly wasteful to duplicate terabytes of data and was also extremely slow compared to the same query running in Postgres.

For now I’ve rolled my own using Debezium CDC and simply reevaluating entire queries with joins when any data changes, but having something more smart would be ideal.


Check out what the folks at https://docs.epsio.io/ built. I'm not aware of any other projects doing this, but I think it's a great idea.


This looks pretty cool, thanks for the pointer! It would be nice if there was a way to create a stream using this where the materialized data doesn't actually get stored on disk in a materialized table, instead it just streams changes to e.g kafka.


RethinkDB attempts to do something like this, but I agree, an efficient SQL version of this would be nice.


The website does a terrible job of communicating what they are. Had to click to the docs to kind of get a quick summary.

In the context of their business I guess this writeup makes sense, it is afterall a marketing piece for their service. I don't agree with it though. I get the simplicity of running everything in a single db stack but I don't think any of the tools they listed were really going to be required to build something like this into another data warehouse.


thanks for feedback. website will get better soon, we're adding more marketing content around the product and pricing this sprint


I should have been more constructive. My issue was on the top fold of the entry site it did not immediately connect to me what the service solves. Why I need the service. Hope that helps!


> We were able to build a data warehouse that is easy to maintain, and easy to reason about, and quick to onboard new engineers.

A data warehouse?

Ingesting raw data != a data warehouse.

I presume their cron jobs are significantly more complicated than that, implementing their domain logic in functions, which will not scale well into an easy to maintain system over time as all database function implementations do… or, what they’ve built is something…

…but that thing is not a data warehouse. It’s just a database.


"Others might have a data warehouse, but we get by on Postgres" ;)

(PS, I love Postgres.)


I was looking for a way to run migrations across SQLite and Postgres without requiring an unrelated language ecosystem (e.g., Node, Python, Ruby, Java, etc), and discovered Shmig.

It's pretty great.

https://github.com/mbucc/shmig

Would love to see DB unit test infra with the the same values.


Fyi, The pricing page[0] is unnecessarily hidden (4-5 clicks). Please make it part of the top menu or show it with the free trial screen.

[0]: https://tembo.io/docs/tembo-cloud/billing


Data warehouse use-cases in Postgres for me generally rely on calculating counts/stats over time, and estimates are sufficient. I really like HyperLogLog for this use-case—fast performance, low storage use, and there’s a great Postgres extension that’s widely supported.


Netezza's DWH appliance was built on top of Postgres [1].

Building datawarehouse on top of postgres is not something thats new. It is definitely lost art.

[1] https://en.wikipedia.org/wiki/Netezza


I love postgres too. We went from a brittle event sourcing architecture to postgres_fdw, works really well. We do something similar with a few differences:

We swap entire schemas on every run (one schema being one upstream database or collection of analysis) atomically instead of deleting data and then inserting data. This means anyone querying the dw never sees an empty table, it's yesterdays data until instantly its todays data, consistently for the entire source. If something breaks the data is just stale, never missing.

We don't do long running selects from foreign tables. This is because a long running select of a huge table can take many minutes. If an upstream database deploys a migration which changes the database schema that migration could time-out and stop the deploy while waiting for the select to finish so it can aquire its ACCESS EXCLUSIVE lock. This means we never hold even ACCESS SHARE locks for more than a couple of seconds. Basically the agreement is "the DW should never interrupt the operation of the upstream databases". We do this by a custom batched_create_table_from procedure. For integer id tables it's trivial. For uuid tables we fetch all the uuids in one go into a temp table with an integer id and then get chunks of uuid values at a time. This works well because we're always sending indexed id values to the foreign server so the foreign server can use its primary key indexes to only give us the data we want for that chunk, without scanning the entire table. It's a bit of a hack and I wish postgres_fdw's fetch_size parameter would release locks between fetches. Of course that would break atomicity so I'm doubting it's gonna happen.

We run an external cron schedule which launches an ECS task that runs a bash script with our sql files that define the DW's schema. It's a really primitive DAG with bash & and wait. No scheduling happens in the db itself. This means we can schedule, check logs, monitor, etc like all our other ECS tasks. Some bash magic lets us define basically a DAG, though we run blocks of independent tasks (sql files) instead of building a proper dependency tree between individual tasks.

I learned about ctid recently, that could have worked for the chunking.

We're currently moving to bigquery and DBT for a few reasons. Primarily to be able to handle silly amounts of user interaction data where the experience for analysts is just much better in bigquery. DBT is also really nice, though I do miss the simplicity of just sql files. The development setup for the old DW relied on rebuilding it locally on a developer machine, which eventually ran into storage limitations and people getting lazy and not rebuilding from scratch before submitting their PRs, we just never got around to building proper CI for it. Rebuilding our postgres DW from scratch takes a few hours now, the same logic in bigquery runs in 10 minutes, which is undeniably amazing.


Forgot to add this, which IMO is the canonical "Data warehouse on postgres" talk: https://www.youtube.com/watch?v=whwNi21jAm4 Lots of really good tips and info in there.


By "customer data warehouse" I presume they mean "customer database"?


I understood that it's more like "customer usage metrics"


Sounds expensive.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: