-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Relational DB VS. Index (like Elasticsearch or Solr) #317
Comments
Hi @kyleobrien91 and thanks for the thoughts. I'd like to understand this more. What benefits would using a non-relational datastore like ElasticSearch or Solr provide for the end user, or for us as admins? Do you see this being just a change in backend, or would that interface be exposed publicly? |
@longhotsummer - so, at their core, things like ElasticSearch and Solr aren't really datastores in the traditional sense of thinking about NoSQL stores. One very typical use case for Indexes (although, there are many more) is a search. So, for example, a pattern in the real world for an index is
This is sort of the typical implementation I've come across a hundred times. However, I've also seen many statistics interfaces coded off the back of an index. So very pretty graph type data representation is very comfortable pulling that large dataset from an index more than its comfortable pulling that same data from a relational data store (even a heavily cached one). Primarily, this works well because Indexes are the best place to store large data into the concept of 'documents' - related pieces of information. Indexes are extremely fast. They work well for data that doesn't change often (tracking banking transactions is typically not a good fit for indexes). This provides, primarily, a good user experience. Some considerationsOperationsBeing able to index involves actually getting a separate service running on a server that will index that data. This is not really that different to running MySQL as a service but is perhaps new territory. User ExperienceThis kind of implementation primarily benefits the end user of both the frontend and the API. Future Implementation of a 'Site Search'Getting an index up and running that serves the data to the frontend in its current form makes the implementation of a site search down the line a trivial task. Resources |
It's an interesting idea to use index systems like you mentioned for this. I guess the 'facts' view in our API (and I guess typical OLAP systems) I'm curious what the performance would be for the 'aggregate' view A use case we rely on a lot is to show the values for a given line item, So far we've gotten performance to an acceptable level with appropriate I doubt we'd like to lose flexibility with the data - we're using On 1 November 2016 at 18:05, kyleobrien91 notifications@github.com wrote:
|
@jbothma - I think to the overarching point about aggregation. Both Solr and ElasticSearch as indexes provide query aggregation. So I imagine, this would work pretty well in a scenario in which pre-aggregating the data isn't desirable - indeed, it also adds a lot of flexibility with the kind of interface one can provide via the API. I don't think pre-aggregated data works well there. So, as resources to that point:
|
@jbothma - then, looking at cubes (never worked with the concept - although perhaps I have under a different name), based on very limited exposure to OLAP, I think ES and Solr more replace cubes than really seek to interface with them. There are a couple of resources on replacing typical OLAP cubes and their supposed lack of future life - although, I have no feelings on the latter. That all being said, it seems that ES works better with interfacing with OLAP frameworks if that's truly required to make development easier. |
I think we went with the OLAP model mainly since that's how people tend to look at data like this, and we want the interface to be familiar. Also to avoid reimplementing the API concepts. So I don't think we'd want to stray too far from these ideas. Another consideration was that Postgres has really been proven to be a low-buzz high performance database over the last few years. So we went with what we know and had in place. That said, I really like considering not-so-obvious implementation options like these that show promise. Someone else asked why we didn't use any of the existing open data platforms - freedom to customise the platform as needed was one of the considerations but that's also something we should reconsider if we need to change. I don't think we'll move forward on alternative backends like solr or ES just yet, but thanks so much for raising this and running through some of the plusses and minuses. If we re-evaluate the backend we're on for performance or maintenance we'll definitely keep this in mind. If you're keen on benchmarking some of these options we can talk through more of the details of the queries that are taxing the database heavily and the sort of characteristics and use cases we prioritise. Since this isn't a pressing issue for us right now we don't have time to really get our hands dirty with alternatives just yet. |
I was just wondering about this.
Thinking
In its current form, we're storing very non-fluid data. It doesn't seem to me like we're really storing any new data between quarters. At most, I imagine we might have to update a couple of numbers here and there (I could be grossly misunderstanding this :)
Potential Suggestion
So, in thinking that, I wonder if an index like Solr and/or Elasticsearch isn't better suited to our needs. They're both persistent, extremely fast, scalable in that we can run multiple nodes and they both have a long history in the concept of 'search'. If we're honest, this is essentially fancy big(ish) data search.
Options:
This also falls nicely into the so-called 'big data' pattern. We could introduce a DAG (like AirBNB's Airflow) to run the steps in the Standard Operation Procedures in the README of the repo and then have it automatically index the data into one of the tools above.
The frontend is developed agnostically of any of that and we can even do cool stuff like index various quarters and do comparisons etc...
Thoughts?
The text was updated successfully, but these errors were encountered: