BA Insight: Why use a database when you can use a search index?
Do you ever feel like change sneaks up on you? Or feel surprised at how much progress you’ve made over the course of time, when you don’t notice it day by day?
I just got that feeling when showing someone how we use search on structured data, instead of using a database. There’s a set of structured data problems that can be handled very nicely with today’s search technology. Over the last decade, search has steadily become a better solution for these problems. I’m deeply engaged with search technology and work with it every day, yet I’m bowled over when I take a step back and see how far it’s come.
Most people still automatically turn to the familiar database and are surprised by the thought of using search. If that includes you, then read this blog post to understand where search fits today and the kind of problems you can solve – in a different way than you are used to. For many applications traditionally handled with databases, it’s remarkable what you can do with search engines.
The data integration problem
We all know how quickly the volume and variety of data is growing, while at the same time more and more business processes and decisions are data-driven. It’s very common to be in a situation where data is spread out across different systems, and where fitting together the puzzle pieces across this data is important. The problem of combining heterogeneous data sources, often referred to as information silos, is called data integration.
The data integration problem comes up over and over. “Know your customer” initiatives depend on combining data from different sources; law firms must aggregate data from multiple systems to get a central view of a client or matter; combining research results from different bio-informatics repositories is essential in life sciences research; logistics and operations centers combine data from suppliers and consumers together; corporate mergers require integrating data from multiple different systems; the list goes on and on.
Evolution of data integration solutions
Approaches to this problem have evolved over decades, and there has been a huge amount of innovation over time:
- Data Warehouses evolved in the 1980s, and today most large organizations have Enterprise Data Warehouses. The approach is to create a single place to store data and bring in copies of data from other systems via extensive extract-transform-load (ETL) processes. Data Marts are a simpler form, usually focused on a single area, offering a subset of data and optimized for a small set of applications. There is a big industry and community around Data Warehouses. (I’ve been a member or speaker with TDWI, The Data Warehousing Institute, for 15 years, and recommend it if you want to get involved). Almost universally, these are built with relational database technology such as SQL Server or MySQL.
- Data Virtualization first appeared at the turn of the century, associated with the advent of Service Oriented Architecture (SOA) and Enterprise Information Integration (EII). This approach integrates data from disparate sources, locations and formats, without replicating the data.
- Data Lakes (a term coined in 2011) store data in its natural format, which means they skip much of the ETL work and usually hold both structured and unstructured data. Data Lakes are usually associated with big data and data science initiatives, and are often cloud-based, using the Hadoop Distributed File System (HFS), Amazon S3, Azure Data Lake, and similar technology. Data Hubs are similar but apply more processing such as deduplication, data quality, and security.
- Search-based data integration and search-based data discovery tools have been in the works for over a decade. I worked on a product called the Adaptive Information Warehouse – AIW- at FAST Search starting in 2006; the Endeca MDEX product started up at roughly the same time. Today companies such as Palantir and Thoughtspot, and open source software such as Elasticsearch, are highly successful and very effective at handling structured data as well as unstructured. This approach has some of the characteristics of each of the others and is particularly flexible and high performance.
Each of these approaches is valid and can be effective; the best fit depends on the application. Traditional SQL-based data warehouses and data marts are by far the most prevalent. However, I suspect many folks are using SQL databases by reflex, rather than considering newer approaches, which could be a better fit.
For the rest of this blog post, I’ll compare traditional SQL databases and search-based data integration. I won’t dive into data virtualization (usually applied in distributed transaction systems), or data lakes and data hubs (usually applied in data science initiatives) here. For the kind of applications where people use data marts, search-based data integration is the main alternative. To understand where search indexes can be better than relational databases in this domain, let’s consider a few applications.
IT management example
Keeping systems running smoothly is important to all organizations, and vital to many. And there is a big market just for software tools to aid this: network management, application management, security management, and the like. Back when I was CTO of Empirix (about 16 years ago) we were one of the vendors in that market, and every single system included a SQL-based data mart with current and historical data. That’s what we knew, and that’s what worked best at the time.
Splunk, a company founded in 2003, changed that. They built a search-based repository as the heart of their product, from which they generate graphs, reports, alerts, dashboards, and visualizations. This disrupted the whole industry and rocketed Splunk to an IPO. More recently, Elasticsearch has made enormous inroads into this market. They have dozens of OEM partners building IT management tools on Elasticsearch, and a major fraction of their 100M+ downloads are for this kind of application. These days a search-based approach is definitely the way to go for log analytics.
Why? Three of the advantages search engines have over relational databases apply here:
- Faster, more consistent query performance: Because search technology is optimized for retrieval, it can provide a faster, more uniform response time than a data mart. It’s not uncommon to see a speedup of 50x (for example, from an average of 10 seconds to an average of 200 ms).
- More Flexible Schema: A search engine is not bound to a rigid schema, so it is more flexible than a data mart. The time and effort it takes to add a new data source is dramatically smaller, and the system is more tolerant of dirty data, missing elements, and new conditions.
- Natural, well-understood UI: People are just more familiar and comfortable with search interfaces than with database interfaces. Most applications won’t have a raw database form, but it’s surprising how much the flavor of the database seeps into the taste of the typical UI.
One outgrowth of these characteristics is the advent of some new visualization tools based on search, including Kibana and Graphana. These take advantage of the fast query performance and the ability to do ad hoc queries to provide a BI-style dashboard experience that is incredibly flexible and easy for users to tweak.
At BA Insight, when decided to build our analytics product, Smart Analytics, we had a choice to make: use a traditional SQL-based approach (including pre-processed cubes for reporting and a BI tool such as PowerBI), or use a search-based approach. We chose to build on Elasticsearch and Kibana, and never looked back.
BI example: personal dashboards
Business Intelligence has a huge range, from simple spreadsheets through enterprise-scale production reporting. Operational dashboards, showing important metrics that may be tailored to specific individuals, are a common pattern. These are predominantly done through SQL-based databases and data visualization libraries or dashboarding tools. But increasingly this kind of dashboard is search-driven. Ten years ago this involved proprietary dashboarding tools adapted to search (at FAST Search we had a thing called FAST Radar); five years ago some major BI vendors like Tibco Spotfire provided search integrations; today there are great open source tools and libraries.
At BA Insight, we’ve worked with a number of very large legal clients that were looking at two distinct projects: Enterprise Search and Personalized Data Dashboards. We were able to solve both projects with a single search engine, cutting implementation time way down, and providing extreme flexibility for the future. Now, when they want to add new data, it's easy, and the data is available in both the dashboards and the search engine at the same time. This process allowed our clients to do data integration without heavy lifting in SQL queries, and the flexibility of the search engine schema means better tolerance of incomplete and dirty data. (I’ve learned that in the real world, there is no such thing as perfectly clean data - all data is dirty). The same factors that made the search-based approach disrupt the IT management space apply to these dashboards: faster performance, more flexible schema, and natural, well-understood UI. What is remarkable to me is how quickly lawyers learn to use ad-hoc queries to dig into their dashboards, and how happy they are with this ability.
Logistics example: dispatch
Logistics management is a big field, ranging from supply chain management through production and distribution. Nearly everywhere you look there are data integration problems – between different companies as well as different systems at the same company. Up-to-date, accurate information must be available quickly.
In e-Commerce (a part of logistics management, or a kissing cousin, depending on your viewpoint), search engines took over from databases for online shopping sites in the late 1990s. The ability for search to handle very large query loads and the natural UI made this a no-brainer. Now the same is starting to happen in other areas of logistics.
The dramatic experience of XPO Logistics, a supplier of pivotal supply chain solutions, is a great example. They replaced the SQL datastore behind a key truck dispatching application with Elasticsearch. Using a MySQL-based repository they were suffering with "normal" response times in the 30 second range, sometimes much longer. With Elasticsearch, they immediately witnessed consistent 150ms response time. This is 200x speedup! In addition, they can handle more traffic and provide a more natural UI for dispatchers and shipping agents. They’ve gone on to do much more with the search technology.
Network analysis example: expertise locator
Network Analysis isn’t a market per se; it’s a set of applications of network science that deal with complex networks such as telecommunications networks, computer networks, and social networks. Social network analysis is used heavily in customer service, marketing, intelligence, and law enforcement. These all involve mapping and measuring relationships and flows between people, groups, organizations, and different kinds of connect information/knowledge entities. Typically, the software that runs these includes some very complex, highly mathematical algorithms, used in conjunction with data from many sources integrated using – you guessed it – a conventional SQL-based data mart.
Search technology is making serious inroads in this area as well. Palantir, for example, has done very well in critical intelligence applications.
All of the advantages of search technology mentioned above apply to this domain too, plus another one:
- Richer, more subtle matching: With traditional data marts, there is a binary notion of “either in or out” that forms the basis of the relational model. Boolean queries are the norm. Search engines include fuzzy matching, that is based on the notion of relevance – things aren’t purely in or out, they are closer or further away in ranking. Queries can easily be based on the form of the language, or take into account where the match is found without needing to be explicit.
For social networks, this kind of fuzzy matching is quite important. People’s names, for example, vary in different settings (I am Jeff Fried in some places, Jeffrey A. Fried in others, and my name is often misspelled as Freed). The connection of people to each other or people to subject areas is not binary – there are different strengths to those connections as well as other nuances.
BA Insight took advantage of this when we built our Expertise Locator product. To identify experts, it’s important to use their whole digital footprint, which can include documents, user profiles, tags, work and billing records, project histories, bios, blogs, posted discussions, emails, and more. There are different levels of expertise and it is important to provide users with a natural user experience to explore available expertise and trade off different criteria. It was natural to build this on top of search.
What's the downside of using a search index?
I would never claim that a single technology covers everything - there’s no silver bullets. There are things that databases do very well that search engines are bad at, for example transaction processing. Databases are built to handle online transaction processing (OLTP), which is the underpinning of everything from banking systems to inventory management. You would never use search technology for these kinds of transactions.
There are also some areas which historically have been advantages for SQL-based data marts over search engines, but where the gap is recently closed or closing. This includes:
- BI Tool Support - There is a big industry around Business Intelligence (BI) including dozens of tools ranging from Excel through Crystal Reports to PowerBI and Tableau. Although some have built great search integrations, most of these only work with relational databases, and not with search engines. However, recently some search engines have been adding SQL support (as Elasticsearch is doing in version 5.3) and even joins. I expect to connect PowerBI to Elasticsearch any day now. (And in the meantime, I’m enjoying Kibana!)
- Hardware Footprint – Relational databases store data across multiple tables and try to reduce data size through normalization, then use joins across tables for BI purposes. Search engines, on the other hand, flatten out (denormalize) data. They gain their speed in part by using more space. This meant that a decade ago, search engines would use a lot more hardware than a database for a given amount of data.
This too has shifted in the past few years. The density of both relational databases and search engine has reached the point that many applications use the minimum hardware (usually dictated by having at least two servers for fault tolerance). As you scale up, there is still a difference but often only on the order of 30%. The cost of hardware has dropped so that this has become less material. And the labor cost associated with that hardware is small – in fact, the availability of cloud-based solutions makes these costs almost nil, as well as reducing the direct cost of infrastructure.
Search engines are still going to be larger than databases (depending on the data mix), but the small difference in hardware nets you huge payback in query performance and flexibility.
Get comfortable with the search-based approach
There is no doubt that search technology is now fully capable of running applications traditionally handled by SQL-based data marts. This approach is now mainstream. There are some significant advantages to be gained in the process, including dramatically improved performance and remarkable flexibility.
If SQL databases are what you are used to, then they will feel more comfortable than using a search engine (or any other type of noSQL DB), at least at first. This is human nature. But you will discover that there is a fast-growing community around search, and a good body of training material available. With Elasticsearch downloads now numbering over 100 million, you are in good company. You may also have the option of simply licensing a product which is built using search instead of a SQL database (for example, BA Insight’s Smart Analytics, Expertise Locator, or Matter Comparison applications).
As I was writing this blog post, I went back and looked at an article I wrote nearly 10 years ago about search on structured data. We were doing exciting things back then, but it’s nothing compared to where search-based data integration is today. This is not a quick fad; it’s been a decade in the making and has huge momentum. I definitely feel like the magnitude of this change has snuck up on me. It’s kind of the reverse of the well-known quote about estimating change in the future. The last line, though, applies whether you are looking forward into the future or back into the past: don’t let yourself be lulled into inaction. The next time you are considering a new Data Mart, take a look at using a search engine instead.