Search Appliances Are Not Datastores

We’re nearing the end of this series on datastores, but we’re going to pause for a minute and talk about search appliances. Search appliances are NOT databases or datastores. They are exactly what their name implies: a mechanism that allows you to search a platform, website, database, etc. However, people often try to use search appliances as a storage solution, so let’s delve into what these are and what they should be used for.

Search appliances, as alluded to above, are meant for searching. There are several vendors around which offer software and hardware search appliances, like ElasticSearch, Solr, Google, and more. The idea behind these products is to take large amounts of data and use the descriptions and categorizations around them that people would want to search on. The appliance will then bring up search results, which allows you to locate the information you want faster than digging through the data yourself.

Bear in mind, this is not your own personal Google for your business. Google works to basically take everything from the internet and web pages and store it in a cache. Then it creates a search index from it, which is hyper complex and involves many different storage methods, and search results pull from the index. The problem lies in that people forget that what they are building is a search index. And the index should then pull from the data you have in your database.

For instance, maybe you have millions of articles you need to search. The appliance can go across the titles and descriptions of the articles relatively quickly via an index, but when you put the full text of millions of articles into the search, the appliance becomes inefficient. For those of you that remember what the library was like before digitization, this is akin to walking the racks and bringing back entire books versus going through index cards to find what you need. It will always take more time to walk the racks and bring the whole book back. It’s more efficient and easier to look at index cards and locate what you need rather than walking aisle upon aisle of books.

Indexes have become more feature rich, allowing you to store more dynamic information. Now you can store a title and you can also store a attributes to filter on to enhance the search and other fields. With this, indexes took on a JSON object-type format. When this happened, people began to think search appliances were essentially the same as NoSQL datastores, which is nowhere near how they work. Just because you can store it there doesn’t mean you should.

You can feed the search index all the data you want, but you’re only supposed to build the index and search datastore around the data that’s actually going to be searched. Then you pull the original data by reference. Pulling your original data by reference can go a little slower, which is when people say, “Oh, well we’ll just put all of our data in here and then pull it back.” Which will work if you have a really small amount of data. Actually, you can stick a really small amount of data just about anywhere and it will work. But, generally, real data sets aren’t small.

So, when your index grows and becomes large, it has to start splitting itself up across multiple nodes and maintaining where the index is, it has to keep rebuilding itself and the search results have to have enough memory to be built in place. Since all the data is connected, it has to pull all that data into the appliance. So now you suddenly need machines that are running tens to hundreds of gigabytes of RAM to file all of that data, and it’s going to cost a ton of money. It will run more slowly and you still won’t get the results you want. Search appliances are not designed for this, they are designed to search snippets of information quickly.

Another thing to bear in mind when using search appliances to hold data is that it’s semi-ephemeral. So when you’re dealing with data that way, to keep it from being fully ephemeral, you have to write the indexes and the root data into another place of storage. This process becomes expensive because loading all of that data takes a lot of time and money to replicate. This is not how a search appliance is designed to function, and you will end up costing your business money it doesn’t need to spend.

The point of a search appliance is to search the things that people are likely to look for from your data sets, and get that data into indexes that are organized in a highly searchable way.

What the search appliance does is organize data in ways that are inverted compared to other means of data storage. For instance, Fuzzy Matching on auto completion creates an index where it takes a word root and stores all the articles around the word root. It is not an efficient way of referencing of data except when searching which is what a search appliance is meant to do. It is not meant to BE your data.

The most efficient way to use a search appliance is to use it to index and not store. You let it crunch through the data you are feeding it, build the indexes and locations. That’s the data you store in a search appliance. It hangs on to the keywords and where data is actually located, and the rest gets discarded. You’re not storing the underlying data.

Search appliances are included in the datastores series because they are often misappropriated and misused by programmers. A search appliance is not and can not be used to effectively and efficiently store data. As with anything technical in business, if you’re unsure of how to do something, hire an expert to help you implement what you need and teach you (or your IT team) how to use it properly.

About the Author

PWV Consultants is a boutique group of industry leaders and influencers from the digital tech, security and design industries that acts as trusted technical partners for many Fortune 500 companies, high-visibility startups, universities, defense agencies, and NGOs. Founded by 20-year software engineering veterans, who have founded or co-founder several companies. PWV experts act as a trusted advisors and mentors to numerous early stage startups, and have held the titles of software and software security executive, consultant and professor. PWV's expert consulting and advisory work spans several high impact industries in finance, media, medical tech, and defense contracting. PWV's founding experts also authored the highly influential precursor HAZL (jADE) programming language.

Contact us

Contact Us About Anything

Need Project Savers, Tech Debt Wranglers, Bleeding Edge Pushers?

Please drop us a note let us know how we can help. If you need help in a crunch make sure to mark your note as Urgent. If we can't help you solve your tech problem, we will help you find someone who can.

1350 Avenue of the Americas, New York City, NY