Rethinking analytics for the petabyte world

0
93


Fuzzy Logix analytics are being used to help explain supermarket customers’ buying habits.


Image: /iStockphoto

Fuzzy Logix’s in-database analytics software and predictive algorithms aim to provide results faster than traditional analytics. ZDNet spoke to the company’s CMO Aashu Virmani to find out more.

How did Fuzzy Logix first come about?
Partha Sen [founder and CEO of Fuzzy Logic] and Mike [Upchurch, co-founder and COO] had both worked at Bank of America on things that required lots of computation power. They soon realised that there are certain mathematical problems that current computing environments are not able to solve.

That growth is not just the growth of data; it’s actually the growth in the combinations of data that is exponential. So how do you fit these kinds of exponentially-sized problems into the huge but finite compute available today?

The founders recognised the problem early, back in 2007, because they were working in the banking sector which feels the constraints most.

Now, given enough time — days or weeks — you can solve these very large problems. At that time, the idea was that you could suck the idea out of a database and put it into enormous quantities of RAM.

Once you had it in RAM on the servers, you could then do many combinations and permutations, run some patterns, and save the results. Once you had done that, you could use a visualisation tool to interact with that.

fzl-vs-conventional-vertical-aug16.png

Image: Fuzzy Logix

Now we think that this approach was OK for the gigabyte economy but is not relevant for the petabyte economy that we live in today.

The other problem is that the data and the applications are all in different places.

Now this is where we come in. We have been doing in-database analytics since before the term was in vogue. What we said, basically, was that data should not be moved to processing, code should be moved to the data. So we take algorithms that find patterns and do analytics and we plunk them inside a database.

It is only now that analysts like Gartner are writing about why analytics might be relevant. It is only now that the world is coming around to this view — that first generation of analytics is broken.

When you look at the whole space of analytics and computing you could say that the data size is very high but the computations are not that high, because you could have less complicated maths. Or, the data sizes are very small but the computations are very high.

Let’s take an example, the drug company Gilead. It makes drugs and it does drug research. If you look at the data sizes, it’s only 200,000 rows of data but they use a legacy tool and take legacy time for compute and that’s five hours.

Now Fuzzy Logix tried it with the same data and the same problem but in database we would have finished the problem in three minutes.

An example closer to home is Tesco. Now, Tesco wanted to get into fresh foods. They have about 40,000 SKUs [stock keeping units] in total and about 5,000 in perishables.

They wanted to know, based on the weather, how much quantity of a given product should they ship to a given store on a given day so that it’s not wasted. This is 2,000 stores, multiplied by 5,000 SKUs which is 15 million SKUs. So, you have to do some predictive analytics and it’s a store/product/day combination.

It was taking the existing product five days to do the 15 million models. We worked with them on an in-service Teradata database and we managed to get the five days down to 46 minutes.

Do you see it as software that is applicable anywhere?
Yes, and over seven years we have coded a library of about 650 algorithms which can do anything from finding outlines in a database to correlations. You know the sort of thing: Is there a correlation between smoking and something else? Or in a market basket, which things are in the same baskets? Why do so many people who buy product A also buy product B but never buy product C? That sort of thing.

Each of these algorithms we coded in a massively parallel way. And that’s why we can show 10x to 100x improvements over the state of the art.

How do we do all this so quickly? Number one, we don’t hold the data — that saves a lot of time. Large datasets can take hours to move.

The second reason is that we have already coded and parallelised the math, and that is the intellectual property of the company.

In simple terms, if your data is distributed on 100 nodes with about a million records each, then if I ask you to find a particular column in a database, it’s a rather easy problem to distribute.

But if we ask the database to, say, give you the minimum age of all the patients in your records, you may get 100 answers back from each of the nodes, one answer each. Then you tell it to pick the minimum of the minimums and that is the arithmetical minimum.

That’s all easy, but if you try to do the median of the median, that problem doesn’t distribute so well. You could have large ages on one machine and the small and median on other machines. The median of medians isn’t always a true arithmetic answer for large datasets and so, often, you will get the wrong answer.

Certain problems are harder to distribute and our intellectual property is that our guys who did the software have, over time, taken very hard problems and done them in such a way that they can break them into 100 pieces. They can then intelligently combine the answers to get the whole answer. That’s where we get the speed from.

When you were doing this, how did you approach the problem?

We started by seeing how it was done with the current state-of-the-art and that is done in a rather serial way. Often it requires doing multiple passes over the data.

We started with the assumption that the databases of the future will be on many machines and they will have ‘share nothing’ environments — the section of the machine that has one slice of data on it does not know about the slice on the next or on system number 15 or whatever.

You almost have to rewrite the maths so that it takes a partial slice of the data and a partial slice of the compute next to that data. If you can divide the problem you could defeat it, and that was the motivation.

It’s using shared nothing as an advantage?

Yes and the world is coming around to this way of thinking. Hadoop has a technology which is a new way of doing cheap storage by distributing data on massively parallel machines. They popularised the term MapReduce, which is effectively the art of breaking a problem into pieces and collapsing the results. We were, in effect, using MapReduce before the term evolved.

Do you work with Hadoop?

Yes. Every database engineer uses SQL as a way to interact with databases. But SQL unfortunately, doesn’t come with a lot of mathematical primitives. It may have a vocabulary of five of six, but it doesn’t have correlations, it doesn’t have regression and so on.

What we did was, we wrote our code in low-level languages like C and we exposed this to SQL so it would have a better vocabulary. It’s like a plugin for SQL.

Did you do the development in the traditional way of putting some code in, checking it, putting the improved code in again, and checking that and going through iterations?

We developed the software over time. It has taken us seven years to get here. A single algorithm may take a team of two or three people almost six months to write.

Then we will get it into a customer situation where we are testing it in 100 million rows and see that it is still not good enough because it is taking half a day so we will work on it some more.

When did the company develop its first product?

The core product is DB Lytix and then we have various versions of it for financial applications, drug companies and so on, and it will be anything up to around 100 different algorithms for each application.

The first product we around 2008 with a small group of people and then in 2015 we got our first big financing and since then we have just grown. We started with two sales people and now we have ten.

The original library of products worked with Oracle, IBM and Teradata. Now we have made it work with Hadoop. We have ported it to work with Nvidia GPUs which gives us 500 to 1,000 times improvement in performance.

Read more on analytics



Source link

Comments

comments