» Scimore News Blog

Indexing 500 millions log entries and searching the Google way

2012-02-29 - Marius Slyzius

Massive logs are quite common at large enterprises. It is very much used for support, and is essential for running the business. Some industries e.g. finance services must retain activity logs for ever. This drives 2 problems: for supporters - make it easy access/search logs, for administrator: where to store it, and make it available on demand. Imagine a single point/hub, where supporter or developer can type the search query for each log type and get the response instantly. Avoiding logon to machine, copying files, and, manually searching, it's a big saver to have. And, it seems the market is currently booming on the topic.

We have been looking into the problem for a while, and finally, it seems we got it right. Our solution is to combine row-based engine (the traditional RDBMS) with columnar/freetext storage (CLucene) in a distributed environment. Row-based storage engine is ScimoreDB native format, utilizing B+Tree and linear scalability, capable of storing billions of rows at a rate of 500K rows/sec. Hence, Lucene indexing rate is slower (down to 15K per CPU core), which is expected, since it has very powerful query engine, and does more work when indexing; still, the speed is quite all right. To deal with the massive logs, we summarize to the 6 steps to follow to implement scalable log indexing and searching solution:

  1. Normalize the Log database: in log table, store attribute Id's rather than value. Attribute values store on separate tables. Have a single index (primary clustered) per table. Remember, you are not going to search the log storage, the freetext index will be used for the search. The goal of normalization is to make as compact as possible the log storage to safe the disk and memory.

  2. Use HDD and SSD as a stable storage: HDD to store database tables and SSD for freetext indexes. Freetext indexes is much more hungry for random/high disk IO's, while row-based engine will do much fewer IO's, and, it will be sequential/batched reads/writes making HDD good choice so far.

  3. Freetext. Index as denormalized data views. Instead indexing attribute Id's, index the values. Denormalized view could be an UNION ALL of multiple JOIN'ed SQL statements. Following SQL example at the end of the article explains in details.

  4. Freetext. Index the most recent data, and, drop no longer used freetext indexes. For example, you can create 2 freetext indexes: one for January 12, and, another for February 2012. When you no longer need searches for January log, drop it. If you will need it again, you can create a new freetext index, and, index January again.

  5. Distribute database among CPU's or machines. For example, take a single box with Xeon, 16 cores. Then, create 15 scimoredb nodes, such, each CPU core, will service a single node. In such environment, the log data will be partitioned among 15 distinct nodes. And, when executing search, 15 CPU's will be searching local indexes in parallel. It will be 15 times faster, than a single server on the box. If you still need more speed, you can go further and add more machines to the database cluster.

  6. Security. Users, who restricted to login to production server and read the log files or access the SQL database, must not be able to do so in log search solution. ScimoreDb supports windows integrated security, and can grant access to Log tables for the particular windows user or the AD group.

What could you do more with the logs? How about using them to answer BI queries, like the number of errors per day/hour, product application loggings per day per version and much more... To compare with the additional OLAP, the multiple dimension tables can be replaced with a single freetext index that JOIN's with Log table (Fact).

A simple walk through example to cover installing, indexing and searching...