Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've done two large search projects in the last year. One with Postgres search and one with elastic search. The elastic search version was trivial to configure, and was embarrassingly superior to the point where I wish I had never considered Postgres.


It's not that simple.

If you're using a secondary data store like Elasticsearch presumably you're combining it with a traditional relational database like Postgres or MySQL. And in most use cases, you're going to require additional constraints on top of full text, like restricting to a specific customer or topic, etc. This means data duplication. Any experienced developer should be breaking out in hives at this point because this means (1) keeping two data stores in sync; (2) adding another service to the deployment process; (3) adding another service to development environments; (4) adding another service to debug when problems crop up; (5) large scale schema changes and migrations are complicated because of two data stores; (6) operational overhead from maintaining Elasticsearch like separate backup scripts and deployment scripts and sharding config (across production, qa, and dev); (7) queries are potentially complicated because you may have to aggregate information from two data stores (pagination and delayed updates makes this even more hellish); (8) if your company is anal about devs accessing production data then ES just introduces another layer of requesting permissions and every time the schema changes you have to ask again.

And maybe the advantages of Elasticsearch far outweigh these negatives, but this is a conversation you're going to have with your devs because the cost of using ES is not zero. There are tradeoffs, and maybe just using Postgres full text search is adequate for your use case given sufficient additional constraints.


Data duplication virtually always happens as the organization gets larger. Even if you stick to RDBMS-only, you're going to get it once you setup master/slave replication, batch jobs, data science, tape backups, etc.

It's not a serious problem as long as you designate one data source as primary and every other data source as derived. That way, if you ever get data corruption issues, you can re-create the derived data and blow away the old copy. You also don't face complex data-sync issues, because all data-transfer is unidirectional. You write into Postgres or your other source-of-truth, and then you write from Postgres to ElasticSearch. If there's a difference between them, the Postgres version is authoritative, so copy its data over ElasticSearch.

Most of your other points are common to adding any new major feature, and you should always weigh the user benefits against the maintenance costs of a large new dependency. Presumably, if you're looking at ElasticSearch, it's because full-text search in your RDBMS is not good enough (which won't always be the case, but has been my experience and apparently the experience of several other people here).


For those thinking that using Elasticsearch is "trivial", please read the parent comment. And read it again. This is pure wisdom.


Not only do some of us no break out in hives at data duplication we actually design with it in mind!

http://martinfowler.com/bliki/CQRS.html


The devops issues aren't a valid counter-point. Tuning an RDBMS is an art.

It's much easier to scale a ephemeral search service than a transactional database. Likely even one search node will support a higher query throughput.


If it's so trivial, can you put the instructions here so the rest of us can learn from your experience?


For me it was

sudo apt-get install elastic-search

pip install django-haystack

add haystack to installed_apps, add the elastic search backend and set the end point.

use a haystack index class that is prebuilt to hook into all model change signals.

done. No weird configurations, everything was vanilla.

Maybe I am missing a step but it took me less than an hour to get everything going and it hasn't had to have any maintenance.

When you are adding ngram support and all the indexes and views in postgres to replicate the behavior, ES looks to be less complicated. At the very least I don't see a reduction in complexity doing the postgres way just that you have 1 less dependency to worry about.


With this solution, it looks like Django is keeping the Elasticsearch index in sync with the SQL database. What happens if the Django process crashes after having committed data to the SQL database but just before having updated the Elasticsearch index? How do you reconcile the index with the database after the fact?


Three options, depending on how demanding your users are:

1.) Don't. So what? That entry will never show up in search results, which is probably exactly what would've happened if you use a search engine with poor ranking, and exactly what will happen if you don't provide search at all.

2.) Blow away your index periodically and re-create it at off-peak times, or upon crash. Works as long as your data set is small enough to read it all off-peak.

3.) When you read your search results back, check them against the source-of-truth and re-index anything that's inconsistent. Relatively easy if there's a 1:1 correspondence between ElasticSearch documents and RDBMS tables; gets more difficult with complex joins.


From my memory, haystack gives you a couple of manage.py commands for synchronisation. So you'd run one of those when spinning things back up.

All the usual caveats about the inherent complexity of distributed systems apply, but it's still pretty convenient.


ElasticSearch has a very good introductory guide included in the documentation. Go through the first half of the Getting Started section and you should be well on your way, with most of the following sections being just for fine-tuning:

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/gett...


Which, unless I missed something, says nothing about you keep your Elasticsearch index in sync with your source-of-truth database.


Depending on your problem domain, you either do periodic dumps of your data into ElasticSearch (oftentimes done nightly, when server load is low), or you do a dual-write layer in your application.

Search is one area where user expectations around freshness are somewhat relaxed. If your search results don't include up-to-the-minute results, in most domains users will forgive you. It's better than having poor relevance.

If up-to-the-minute results are critical, then update your DB abstraction layer to write into ElasticSearch as well as PostGres. It's a pain, but manageable. Consistency, again, is not as critical for search as for a source-of-truth store, because a missing or stale record is the same as a poor ranking algorithm that didn't retrieve it in the first place. And if your application detects a bad search result, they can always go back to the source-of-truth and re-index it.


"oftentimes done nightly, when server load is low" what does this mean? Unless you are running a highly localized website, websites don't have a nighttime. We see pretty much constant usage 24/7 (though a bit higher when it is daytime in Asia). This probably would have made more sense thousands of years ago when the world was flat and the sky barge would pull the sun over the horizon into the underworld to travel back to the other side during the night, but nowadays? Not so much.


irrational is right: it sounds trivial only when you don't consider the evil details.


I have used ElasticSearch also on a number of projects and it is an extremely simple to install, configure, use and manage. Most of the functions are trivially available over simple REST HTTP calls.

Unless you need basic search and you desperately want to keep everything in the same database I would split it out. Architecturally it makes more sense to split it out just to allow for different scaling needs.


From my experience Elasticsearch is far from trivial to configure right. At least compared to PostgreSQL.


Superior in what ways?


With Elasticsearch and Solr, you can easily customize analysis and scoring. There are several scoring algorithms built into them [1] such as BM25 (the default now) which is considered the state of the art for keyword relevance. For analysis, you can remove stopwords, stem, apply synonyms, etc [2]. Elasticsearch is specifically designed to scale across multiple machines which is necessary for TB datasets. There are also things like "more like this" queries and context-aware spell checking. Some of that you can do with PG, but not all of it. If PG can do it, it is usually harder to set up.

1. https://www.elastic.co/guide/en/elasticsearch/reference/curr...

2. https://www.elastic.co/guide/en/elasticsearch/guide/current/...


The PostgreSQL team is working on some of the weak spots. Please have a look a the new 'RUM' index, that should improve ranking:

https://www.pgcon.org/2016/schedule/attachments/436_pgcon-20...

https://github.com/postgrespro/rum


Rich library of tokenizers and analyzers. A test proof analyzer model and pipeline. For full text search, different score modes are supported that is beyond the trivial case mentitioned tf-idf model, how are going to do field centric ranking in postgres?

As far as I am concerned, it is far superior to the goodies that mentioned in this article.


For many the source of truth is still Postgres, but you may want some of the flexibility and search power that comes along with ElasticSearch in many cases. An interesting project to look at may be ZomboDB which lets you maintain ElasticSearch indexes from directly within Postgres and query from within PG directly. (https://github.com/zombodb/zombodb)


Postgres was so much easier for me. In fact I cannot see the reason Elasticsearch is so popular. For our website search I first researched Elasticsearch. Each of the short pages of the so-called Getting Started guide (https://www.elastic.co/guide/en/elasticsearch/guide/current/...) felt like walking into either (a) a brick wall or (b) an empty room:

p. 1: sales pitch

p. 2: sales pitch

p. 3: "The only requirement for installing Elasticsearch is a recent version of Java." Ugh.

pp. 4-10: A new JSON API to learn. This is just the part about getting your stuff into Elasticsearch and, at last, doing a query that is no better than "select * from table where last_name like '%Smith%';

p. 11: "More complicated searches," an apt title. Here is the Elasticsearch query that's equivalent to "select * from table where last_name like '%smith%' and age > 30":

  GET /megacorp/employee/_search
  {
      "query" : {
          "filtered" : {
              "filter" : {
                  "range" : {
                      "age" : { "gt" : 30 } 
                  }
              },
              "query" : {
                  "match" : {
                      "last_name" : "smith" 
                  }
              }
          }
      }
  }
It only gets more complicated from there. There is talk about ranking, but it was still a black box even by the end of the Getting Started guide, and I was going to want to tweak it. It just seemed more obvious how to do so with Postgres (https://www.postgresql.org/docs/9.6/static/textsearch-contro...).

How is Postgres merely "good enough"? How are other products better? What is their draw?

Is it ease of set-up? For me Elasticsearch was harder to install than Postgres and made me learn a bunch of new syntax, beyond the SQL I have to know anyway to write web applications in the first place. Postgres handles stemming, stopwords, ranking, snippets, and highlighting.

Is it better ranking? Postgres can weigh the search words' frequency, their proximity to each other, the length of the document, the number of unique words in the document, and the "mean harmonic distance between extents." I was able to further tweak the results by using other columns and joining other tables.

Is it faster search results? Postgres searches tens of thousands of pages in a split second.


It seems like setting up and syncing an entire separate database for search wouldn't exactly be "trivial"...


I've migrated last week 500M documents (100GB) of data from mongodb to ES. It's quite simple and straightforward. Used the bulk API. Just make sure to have a good machine (16vCPU) and use parallelization and just wait. after 24hours, it just works.


How do keep the ES index consistent with MongoDB?


That's only 200 bytes per document.

Can they be really considered 'documents' or are they actually database fields?


That's an average.


Elasticsearch can be installed in seconds with most package managers and syncing is trivial depending on what you want to index. You can use a JDBC river for bulk indexing or rely on trigger's.


If it's so trivial, why don't you just put the instructions here so the rest of us can learn from your experience?


"Syncing is trivial" Have you got any links to help me set up PostgreSQL to Elasticsearch sync?


I don't know about elasticsearch specifically, but here are postgres -> x instructions using LISTEN/NOTIFY.

https://www.chrisstucchio.com/blog/2013/postgres_external_tr...


Rivers were deprecated and removed from 2.0. The syncing story isn't terrible but it's not as simple as it used to be.


Looks like the new syncing story involves logstash as an ETL platform

https://www.elastic.co/guide/en/logstash/current/deploying-a...


Eh, that's more for logs specifically. The "syncing story" is something that depends on your technology platform and specific situation. I adopted ES when it had rivers, but never even considered using them for syncing.


For what it's worth Logstash is cited as a replacement for rivers in the deprecation announcement.

https://www.elastic.co/blog/deprecating-rivers


that's disappointing :-(


> syncing is trivial depending on what you want to index

Can you explain how you keep your Elasticsearch index consistent with your main database, in face of system failure (for example a network partition or a machine hard reboot)? This is usually the non-trivial part.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: