Monday, July 15, 2013

Big Data, what it can and can't do

With all the hype around the big data, I happened to attend the Fifth Elephant 2013 conference to understand the playing field better. The speaker list was impressive and had some industry bigwigs like Dr. Edouard Servan-Schreiber, Director or Solution Architecture at 10Gen, the MongoDB company, Dr. Shailesh Kumar, Member Of Technical Staff from Google, Hyderabad and Andreas Kollegger Experience Architect from Neo4J to mention a few.

The experience was thoroughly fulfilling and it was nice to rub shoulders with the local tech community and connect on such a scale. It's just fascinating to see the amount of data that some companies generate, capture and operate upon on everyday basis.

 The blog contains my take on the technology applications and limitations, again thoughts may vary and that's why we have a comments section.

First, I would discuss where we cannot apply or use bigdata/NoSQL paradigms:

  • It cannot be used for applications and systems which have high volume of transactions which are long/complex or the system requires multiple join queries. That's something no NoSQL implementations guarantees so far. It may be on the cards but seems less unlikely as it will take the flavor of the non RDBMS implementation.
  • It cannot be applied to legacy systems which are tightly coupled with the data base systems. e.g. in one of my previous projects, one application was very DB heavy, as in it had a lot of functions and stored procedures which were the code of the application logic. So, even though the app had a huge amount of data, this coupling makes it difficult to move to a NoSQL implementation.
  • It is not a choice for applications which deal with a small amount of unstructured data. Honestly, because we cannot use and elephant to scare the mouse, a cat would do just fine.
  • It essentially cannot be used for anything that operates on real time. e.g. capturing data from a F1 car to do a real time diagnostics and see where the problem might come (or maybe we can if a little bit of latency would not be a problem).
NoSQL/bigdata have given us the power to operate in near real time on a very huge data set, but of-course the speed of the operation depends on the implementation of the crunching logic. So, in order to have a fast op (read low latency and high throughput) we need a NoSQL DB and near real time processing/crunching  capabilities.

Now let's touch upon some areas where bigdata/NoSQL can have a big impact:
  • e-Learning is one of the classic examples. I was working with an application which had a lot of custom courses, exams and associated media for students registering to take the course. It was designed with the rigidity of a RDBMS, but in retrospect I feel that this is a good candidate for a NoSQL implementation.
  • Banks and commercial institutions are already implementing big data in a lot of ways, and fraud monitoring agencies and companies rely on the processing capabilities of the bigdata stack to do transaction analysis in near real time. The transactions data still goes to a RDBMS system but a lot of other data is not being recorded on to NoSQL databases for trend analysis and simply put, faster access/look-ups.
  • Content Delivery Networks are also using bigdata stack for optimizing web app performance.Citibank has such a implementation, where application renders out of a content cache which uses MongoDB as storage. There can be a custom cache controller written over the DB to achieve something like this.
  • Bioinformatic  and cheminformatic systems can also leverage NoSQL databases for faster responses. I happened to work with the industry leaders Accelrys Inc. in chem and bioinformatics and there were a few application that I saw could definitely benefit from the bigdata stack. Some of their products can also use graph databases especially with the development of Accelrys Enterprise Platform AEP.
  • Large scale analytic processes and applications are the classic use case of a bigdata/NoSQL stack. Meteorological systems, trade analysis systems, logistics systems are places where we can use bigdata stack and I am sure is being used in some places. These systems need near real time analytics and also require data trends and reports over large data sets and over a long period of time, and that is where bigdata stack can help.
Lastly I would like to close the post with a discussion hat I had with a peer about example of having hue amount of data. Remember Gary Kasparov, who defeated Deep Blue and was a year later defeated by Deep Blue successor. The reason we concluded that the latest Deep Blue won was not because it was faster and better, but because it had a bigger data set and crunching ability than its predecessor.

So, it's the high volume of data that will win over a period of time than a well written crafty algorithm.

No comments:

Post a Comment