Looking forward for part 2 / postgres vs elasticsearch. One application at my company uses PG for CRUD on objects and elastic for searching them. We completely underestimated the effort of keeping the two datastores in sync and are actually contemplating to get rid of elasticsearch.
A technique I've used before is to treat Elasticsearch as rebuildable at any time, consider this approach:
A cron runs every 5 minutes that looks at your database for any objects you're indexing where last_modified_at timestamp > last_indexing_started_timestamp.
Index the objects into Elasticsearch, then update the last_indexing_started_timestamp value to be when you started the original sync process, so we catch any modified objects between the start/end of the update run, next run.
Then if Elasticsearch needs rebuilding you can just clear out the last indexing timestamp and resync from the start of time, and its self-recovering / won't get out of sync.
I've done exactly this for about a decade and it has worked supremely well. Robust and resilient because it's simple and idempotent.
In my case I'm using Solr and my last_indexed field isn't written to until the Solr index call completes without error. I have a very basic lock on the indexing process which hasn't failed me yet, and if it ever did fail the consequences would only be wasted CPU cycles. I consider that a lower risk than updating last_indexed only to have the actual indexing fail unexpectedly.
In the rare instances I've needed to re-index from scratch the process has been incredibly simple:
1. Start a new instance of Solr on a powerful AWS instance and direct index updates to it
2. Set all last_indexed fields to NULL
3. Wait for the scheduled task to complete the re-indexing
4. Reboot the new Solr instance on a sufficient AWS instance
5. Shift to the new Solr instance for search engine reads
Tried this as well but the difficulty is that Postgres is a relational database whereas ElasticSearch stores schema-less documents.
Your record in ES might include data from many different tables, and figuring out what to (efficiently) update when there is a change in Postgres is not a simple task.
For me, a shotgun approach seemed the least likely to break.
Anything that is a dependency in the elastisearch index should trigger a job to export to it. And since it is idempotent it doesn't matter if it accidentally exports two or ten times the same index in a bg job. Just make sure before writing that you do a quick check that you're not overriding a fresher one. So just have a freshness timestamp which is the latest timestamp of any record used in the indexing data.
Furthermore you can do a daily job to just re export a critical part of the index. Doesn't matter if it is or isn't fresh. So let's say you query all records that were modified in the last day, and trigger the export job thatnmaynincludebthat record. Even if it causes duicate work. Idempotency saves you there.
Perhaps include a "last modified" timestamp w/ timezone in tables of interest, PG can update this on a trigger so no app code has to change. Index this field. Then build a view on top of the relevant tables that assembles the document for ES. Include in the view a field which contains the most recent of all the "last modified" dates, and filter the view on that timestamp?
I had a similar setup at a previous gig and didn't find it particularly challenging. Whenever an update happens to an entity in PG we send a message to async replicate that entity in ES (via ID lookup in PG). As always you need good monitoring and retries for async computation but ES is quite stable and fast so we rarely had any issues.
Now, we had pretty lax consistency requirements, as long as the "latest state" of PG ended up in ES within a reasonable timeframe everything was fine so maybe your requirements are different.
Yeah, I think more folks should learn about postgres LISTEN and NOTIFY functionality, https://www.postgresql.org/docs/current/sql-notify.html, it's great for stuff like this, and there are now various client libraries like pg-listen in NPM that make this really easy to use.
This is especially useful if you have lots of different teams/code paths that may update your DB - just set up a trigger that causes a NOTIFY message to get sent, then have a client responsible for reading for PG and populating ES. Alternatively, if you can accept a bit more latency, just have a trigger that sets a "needsESIndexing" dirty column somewhere and have a polling process that picks rows WHERE needsESIndexing = TRUE and just updates this to FALSE when the indexing is complete.
It has been several years since I worked on a system that implemented listen/notify, but I recall there was a significant caveat that if the listener was unavailable at the exact moment the NOTIFY executes, the message would be lost.
That’s a significant risk for things that need to be in sync between two systems, so we stuck with listen/notify for info-level ops things and used polling/queue systems that offered better guarantees for more important tasks. Don’t want to be in a position where a quiet hiccup with a deploy or something results in fun bugs like 0.5% of rows being silently out of sync between ES and Postgres.
You could put a Nats JetStream connection in line and you'd have a good buffer to catch the notifications. Not fool proof but then you'd need some kind of batch process that looks for any outstanding changes and syncs them.
We do something like this with our systems. External events get written to the event bus but all operations are idempotent on the event bus. So at night we send another round of the days events to clean up any inconsistencies.
you can also just use a CDC to import from Postgres into Kafka using Debezium and then on the other end Kafka Connect to write into Elasticsearch, no need to write any code at all with this setup ymmv ofc
if you already use Kafka anyway its a valid solution, I wouldn't introduce Kafka for this obviously. Also Kafka Connect and Debezium are mature well maintained open source projects in case that wasn't clear enough.
I went a similar route (Postgres for CRUD and Elastic for searching) and also underestimated the effort of keeping the two datastore in sync as well as underestimated the effort in maintaining a reliable Elastic cluster with limited manpower/experience. After moving to Postgres full text search with indexes and query boosting, I accomplished everything I needed inside Postgres with update triggers and search queries that were incredibly performant.
I have done this. It was a challenging task, even with significant experience, and some results aren’t as good (most are as good, very few are better).
Overall it was a success - our ops work is very much reduced, enough so to have easily paid for the engineering time invested. Not to be undertaken lightly though.
Depending on your needs, you may be better served by materialised views, normal views, or triggers. The builtin text search may not suit your use cases; it’s not necessarily hard to come up with alternative schemes.
>We completely underestimated the effort of keeping the two datastores in sync and are actually contemplating to get rid of elasticsearch.
Can you expand on this? Is it that it's tedious to write code that updates both? I've been meaning to play around with meilisearch and was trying to think about the synchronization issue.
This is exactly why I (as a solo developer at the time) just used Postgres' built-in fulltext search when that was a requirement. Worked out great, although the "simple" interface to it was a bit too simple and I ended up having to build up a DSL to translate searches into the more complex interface to the fulltext index.
And any modifications to any field that was indexed, or having to update how things were indexed, was a chore thanks to referential integrity enforcement at the DB level: I had to remove and afterwards reapply things like foreign key constraints, triggers, stored procs, etc.... for both the "up" AND the "down"! Fortunately, since Postgres lets you make schema changes in a transaction, there usually wasn't anything to worry about integrity-wise.
I think they are not comparable, for some basic queries maybe PG is enough, but I don't think than the text processing capabilities of lucene are there, and it's ok because one is a database, and the other a search engine. Elastic has drifted a lot to being like a database, but I least for me, what has worked very well is having a database as a source of truth, and denormalizate the database as needed to Solr which is queried by the site. Just something like faceted search navigation, is way easier to do well in Solr or elastic versus PG
Back in 2017 when spectrum and meltdown hit I was boarding a flight to mexico with my wife for a short get away. I noticed a pagerduty as the flight started to take off about a backup in our queue. Our company handles millions of calls daily and at that time we indexed each call record as a single document into elasticsearch (ES). When I landed and finally got to our hotel and plugged into the wifi the problem was significantly worse. As a result of the reduced compute capacity from the patches aws rolled out the night before we observed over all servers including our application servers, database and elasticsearch roughly a 30% reduction in computing power. I remember sitting on the beach trying to figure out a good solution. batching the index operations thanks to _bulk endpoint was immediately what I realized we had to do just wasn't clear on how to get there. After 2 days of hacking on the problem and explaining to customers it would be okay no data was lost, I had a solution that involves redis sets to ensure we only index unique call records and a self queueing bulk indexing job that also limits the total number of jobs queued as a function of our overall capacity. Most frameworks assume a single record indexed based on update/create/delete operations is the way to go but that doesn't scale... It's been a few years now since then and elasticsearch has proven it's worth over and over again as a denormalized index allowing for faster search and aggregated reporting that our normalized database could not provide...
The easiest log retention setup I've ever used was just a server w/ logs dumped into a text file searched by zgrep. I'm not it's not ideal for all use cases, but I miss the simplicity it.
You might trade for a different problem, which is that search could end up being a very different query load on the Postgres cluster compared to existing queries. That might make it hard to scale it out for both use cases.
I think what they just mean is that you only have all your searchable fields indexed in ES and then, when you do a search query, the only thing ES returns is PKs which you then do a second roundtrip to the database (or cache) with to get all the actual records from. Which means you don't have to worry too much about all the record information being in sync perfectly.
The GIN index has some similarities to Elasticsearch's inverted indices (last I knew anyway), which also can be quite expensive to write to. If you're doing heavy writes, something to test and consider carefully.
TLDR; writes get a lot more expensive with GIN indices.
Anecdotally, I've heard of successes using it at relatively large companies for syncing pg to elastic. I don't have first hand experience.
There's also pgsync[0], but again, this is just from some preliminary research on my part, can't speak to relative or absolute quality for either option.
We did that at my previous job. Because we had a really good idea of what was searched and used we indexed everything really well. We only had around ~ 6b records but it turned out to be faster to search than ES. And we didn’t have nodes turning bad every other month.
Yep, CrateDB is not ACID but has eventual consistency at the row level.
I think for some use cases it's fine that some inserts are not visible in search results immediately. Also, once could tweak refresh_interval or use explicit refresh for "must be visible" inserts.