Monday, March 25, 2013

Big Data is not a solution. It is raw material.

Lots of companies these days talk about “Big Data” and some seem to be misusing this term. As it happens for many new trends and concepts companies started talking about Big Data meaning different things and abusing the term in marketing talk.

My view is that technically Big Data is not a tool to solve a problem, or a solution, but it is the problem to solve. It is the raw material potentially containing hidden gems. These gems are what we are after, and distilling the data into something not so big is what we are trying to do.

Big Data simply means: “a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” (Wikipedia). In other words, it is a large amount of complicated and interconnected data that you have the need or desire to refine into usable and manageable information.

Saying that you want to “leverage big data” is like saying that you want to leverage raw material collected from a gold mine. What you want to do is to extract the gold from the raw material, and use the gold for some purpose. The rawness of the material is not the feature of the material. The feature is the gold in the material. The rawness is the problem.

This might seem like a semantic argument, but I think it is not just semantics. The term and its popularity infer that we still don’t have tools generally and easily available to manage the amount of data that businesses deal with.

We tend to collect more data that we can deeply analyze at low cost, which makes it difficult to study and find the value in the data. In fact not having tools to easily analyze large data sets means that it is difficult to find patterns. To study patterns you need to be able to observe the data from both a high level bird’s eye view and at the detailed level. You need to be able to zoom in and out at will.

With big data this analysis gets difficult. You are required to sample the data and study manageable subsets of it. Then you have to identify patterns in the subsets, and form your theories. Than you have to build tools to confirm or debunk the theories on increasingly larger data sets. Once you have a good solid theory you need to build tools to be able to use that theory and distill the data into usable gems.

This approach still requires lots of hand building of massively parallel systems. Tools have been developed and have been around to do this kind of work. For example MapReduce – which is Google named implementation of a programming model for processing large data sets - is one of them.

Once you have the software to run, Amazon AWS today gives you the storage space and computing power necessary to spin hundreds or even thousands of servers that can be used for the duration of the data processing tasks at a fraction of the traditional costs. When the processing is done, you release the hardware, which instantly stops costing you any money.

So there are some tools. But all of these tools are like nuts and bolts. They in fact are fantastically powerful and at the base of the current technological evolution. However we live in a world where we have nuts and bolts and we are already in need of cars and airplanes. The demand drives the resources, and thousands of people are attacking the problem from various angles.

So, all of this to bring one main fundamental message: when you speak about Big Data keep in mind that the term is going to be always defined in terms of the problem, not the solution. It is great to have lots of data. However how you use the data is going to determine its value. Not the raw data itself.

Saturday, March 16, 2013

Generalists are the key of innovation.

If you fill your organization with dogmatic specialists, it is going to be extremely difficult to innovate.

For example imagine working in an software development shop where engineers call themselves "Rubists". They participate to Ruby community events, they read every Ruby book and blog on the market, they study every aspect of the language and constantly find ways to more creatively phrase their code in Ruby. They debate endlessly about the best test framework, the best gem available and how to write one line of code in the most readable way. They choose cute nicknames and attempt to convert as many developers as they can to their Ruby cult and add disciples to their following that soon starts looking more like a religion than a community.

 If your organization is static and not interested in serious innovation and if you are willing to put up with that kind of environment, it might even work for you. For example if you build small standard websites for simple e-commerce applications, you might even be happy with that. More power to you.

 However, now imagine that your organization in order to innovate needs to write a high performance system to crunch large amounts of data in real time. Or perhaps you want to expand your reach and release an iPad application.

Your group of dogmatic Rubists will try to either avoid these kinds of projects making all sorts of excuses, or do these things in Ruby. They will attempt to write code in Ruby that simply doesn't belong in the Ruby world.

Crunching numbers in Ruby is slow and attempting to write an iPad app in Ruby is a compromise that you don't need to make. It is like forcing a square peg into a round hole. You can do it but the result is not very good.

 On the other hand, if you surround yourself with strong generalists that can make the best decision for the problem domain, and maybe a few temporary specialists (contractors), you will be positioned for more effective innovation.

Taking a technological stance is an impediment to innovation that you do not need to bake into your DNA. A language or framework are tactical choices, not strategic ones.

Thursday, February 28, 2013

Ruby's Opinion on JSON Handling

This is just beautiful:
$ irb
>> require 'json'
=> true
>> a={123=>"number","123"=>"string",:"123"=>"symbol"}
=> {"123"=>"string", :"123"=>"symbol", 123=>"number"}
>> JSON.unparse(a)
=> "{\"123\":\"string\",\"123\":\"symbol\",\"123\":\"number\"}"
>> JSON.parse(JSON.unparse(a))
=> {"123"=>"number"}
So apparently the standard JSON gem in Ruby thinks that it is ok to convert - without errors - a hash with a bunch of non-string keys into a hash with string keys where the converted-to-string keys collide.

I can see no advantage for this to work this way. It should throw an exception when attempting the unparse operation.

On the other hand I am actually ok that parsing {\"123\":\"string\",\"123\":\"symbol\",\"123\":\"number\"}" works without generating errors. The parser should try to do whatever it can to interpret the data. That said, it would be nice if it was possible to also turn on a "strict" mode where any issue of this sort would be detected and reported.

I understand this was an implementation choice. It would not have been my choice.

Thursday, May 10, 2012

Mysql Makes Me Go "WAT?" Sometimes...


I am very well aware that there is a TIMEDIFF function in MySql to calculate the difference of two dates, and I am sure there is a very good explanation for what follows... and if you know what that is please feel free to share all the details... BUT...

...you would think that IF in whatever language you have a subtraction operation between dates that returns a number… and if you take a date D, you add a day to it and you subtract D from it, you'd get something that means "1 day" in whatever unit (seconds, minutes, something…)

You'd also think that the unit would be meaningful and usable somehow.

Well… this is not MySql's opinion apparently, or at least it is not in any common sense, predictable or understandable way. 

Check this out:
mysql> select (NOW()+INTERVAL 1 DAY)-NOW();
=> 1000000

ok… so apparently 1 day is 1,000,000 units… right?
Not so much...
mysql> select (NOW()+INTERVAL 12 HOUR)-NOW();
=> 120000

12 hours are 120,000 somethings. Huh? Let's try some more cases:
mysql> select (NOW()+INTERVAL 8 HOUR)-NOW();
=> 80000 
mysql> select (NOW()+INTERVAL 1 HOUR)-NOW();
=> 10000 

What about one minute?
mysql> select (NOW()+INTERVAL 1 MINUTE)-NOW();
=> 100 

That is 100 units… wat? What about 59 seconds…
mysql> select (NOW()+INTERVAL 59 SECOND)-NOW();
=> 99 

Ok… that's odd but maybe some kind of rounding thing going on? Let's try some other cases…
mysql> select (NOW()+INTERVAL 58 SECOND)-NOW();
=> 58

Huh?
mysql> select (NOW()+INTERVAL 57 SECOND)-NOW();
=> 97 

WAT? How did we go from 99 to 58 to 97?
mysql> select (NOW()+INTERVAL 56 SECOND)-NOW();
=> 96 
mysql> select (NOW()+INTERVAL 51 SECOND)-NOW();
=> 91 
mysql> select (NOW()+INTERVAL 49 SECOND)-NOW();
=> 49

Remember this 49...
mysql> select (NOW()+INTERVAL 49 SECOND)-NOW();
=> 89

Look at that, now it is 89… :) 
mysql> select (NOW()+INTERVAL 1 SECOND)-NOW();
=> 1

AND of course, since we are in seemingly random-land:

mysql> set @x = now(); select ( @x + INTERVAL 49 SECOND) - @x;
=> 20120510130600

… yeah, that makes total sense. I am sure there is a very good technical explanation, but….

LOL!

Monday, February 20, 2012

Ruby Blocks and Procs

There are two main ways to create a method that takes and executes a block in Ruby. The first is using yield:
$ irb
>> def compute_with_yield
>> yield
>> end
=> nil
>> compute_with_yield{1+1}
=> 2
Note how the method compute_with_yield takes no parameters, and the block is executed simply invoking yield. Yield grabs the block attached to the method it is called from, and executes it. There is no indication in the signature of compute_with_yield that a block is expected. I find that to be an issue, but I guess that's something that documentation can solve. What bothers me is that I'd like to know how to use compute_with_yield just looking at the signature. Documentation should just be necessary for the details on what the method does and how it uses the parameters, and should not be necessary to know how to call the method. This to me is a defect in the Ruby spec.
That said, the second way is to pass a block passed as a parameter. This is useful when you want to write a method that passes the attached block as a parameter to a second method. It also makes it clear from the signature that the method expects a block:
>> def compute_with_block_call(&block)
>> block.call
>> end
=> nil
>>
?> def compute_passing_a_block_to_another_method(&block)
>> compute_with_block_call(&block)
>> end
=> nil
>> compute_passing_a_block_to_another_method{1+1}
=> 2
You can also write an equivalent of compute_with_block_call in this way:
?> def compute_using_proc_new
>> compute_with_block_call(&Proc.new)
>> end
=> nil
>> compute_using_proc_new{1+1}
=> 2
Huh? What's going on here?

This is using a little known (but well documented) propertyof Proc.new. When invoked with no block, it acquires the block attached to the method it is called from. Or, as better stated in the documentation: "Proc.new Creates a new Proc object, bound to the current context. Proc::new may be called without a block only within a method with an attached block, in which case that block is converted to the Proc object."

So should you be using yield or block.call? It seems that passing a block as a parameter and then having the ability to either pass it to another method or to call block.call gives more options. Also it seems to make the API clearer, because it defines the expectation of a block in the signature. So why bother with yield at all?
The reason you should always use yield without specifying &block as a parameter is that when you pass a block as a parameter you are implicitly creating a Proc object, which is an amazingly slow operation.
A simple demonstration of this fact can be shown with this simple example:

def compute_with_yield  
    yield
end

def compute_with_block_call(&block)  
  block.call
end

def compute_passing_a_block_to_another_method(&block)  
  compute_with_block_call(&block)
end

def compute_passing_a_block_and_ignoring_it(&block)
end

def compute_using_proc_new  
  compute_with_block_call(&Proc.new)
end

require 'benchmark'

n=1000000
Benchmark.bmbm do |x|  
  x.report("compute_with_yield") do    
    n.times {compute_with_yield { 1+1 }}  
  end  
  x.report("compute_with_block_call") do    
    n.times {compute_with_block_call { 1+1 }}  
  end  
  x.report("compute_passing_a_block_to_another_method") do    
    n.times {compute_passing_a_block_to_another_method { 1+1 }}  
  end  
  x.report("compute_passing_a_block_and_ignoring_it") do    
    n.times {compute_passing_a_block_and_ignoring_it { 1+1 }}  
  end  
  x.report("compute_using_proc_new") do    
    n.times {compute_using_proc_new { 1+1 }}  
  end
end

This script produces the following output:
user system total real compute_with_yield 0.520000 0.000000 0.520000 ( 0.519002) compute_with_block_call 2.940000 0.080000 3.020000 ( 3.030361) compute_passing_a_block_to_another_method 3.040000 0.000000 3.040000 ( 3.036327) compute_passing_a_block_and_ignoring_it 1.970000 0.140000 2.110000 ( 2.112733) compute_using_proc_new 3.180000 0.140000 3.320000 ( 3.316456)
As you can see compute_with_yield is way faster than all the other variations.

Passing a block as parameter and using block.call to executes takes about 5 times the time of compute_with_yield. That is a huge overhead!

Passing a block to another method that then executes it adds no much overhead. That is because the initial creation of Proc dwarfs the time needed to make the intermediate method call.

Even just passing the block and ignoring it is way slower than the version with yield, where the block is actually executed!

Using the Proc.new "trick" doesn't seem to add any substantial overhead.

Saturday, February 11, 2012

Ruby object_id exploration

Everything has an object_id
Everything in Ruby is an object, including numeric constants. Every object instance has an object_id. That means that pretty much anything in Ruby can be asked for its object_id.

Let's see some examples.
Every time that you use a string, a new string instance is created:
$ irb
>> Kernel.object_id
=> 2223180980
>> nil.object_id
=> 4
>> false.object_id
=> 0
>> true.object_id
=> 2
>> Class.object_id
=> 2223181060
>> "a".object_id
=> 2224952320
>> "a".object_id
=> 2224948520
However for Fixnum integers, the same number has always the same object_id:

Monday, February 6, 2012

Bits and the power of two

Powers of two have only one bit set to 1 in their binary representation. Using this fact is possible to write a very simple function that can answer the question "is integer X>0 a power of two?". Here is an implementation in good old C:
/*
 * Returns true if x>0 is a power of 2.
 */
int is_power_of_two (unsigned x) {
    return x>0 && ((x - 1) & x) == 0;
}
Or in Ruby:
#
# Returns true if an integer x>0 is a power of 2.
#
def is_power_of_two?(x)
   x>0 && ((x - 1) & x) == 0
end
But... why does this work?

Friday, February 3, 2012

Why is undefined a=a nil in Ruby?

In a recent Destroy All Software talk titled "WAT" a behavior of the Ruby interpreter was given as an example of surprising facts about the language. Here it is:
$ irb
>> a
NameError: undefined local variable or method `a' for main:Object
from (irb):1
>> a=a
=> nil
So, why is the expression a=a equals to nil, when "a" is undefined?

Tuesday, January 31, 2012

rubyperf gem: the basics

Here is an introduction to Rubyperf, a gem that makes it easier to measure execution time of blocks of Ruby code. Assuming that you have rubygems installed, rubyperf is available on rubygems.org and can be installed simply with:
$ gem install rubyperf
Depending on what version of ruby and environment you are using, you may have to specify:
require 'rubygems'
before you can use:
require 'rubyperf'
Let's take a peek at an example of usage. Let's say that you want to find out how long it takes to compute the factorial of 10,000 in Ruby.
(1..10000).inject(:*)
(the result is a number with 35,660 digits...) Using rubyperf you can measure the performance of this code with something like:

Monday, January 30, 2012

Wrapping methods in Ruby 1.8.x


For an open source project I am working on (more on this later on) I had the need to wrap functionality of existing Ruby classes programmatically without actually having to open up the class and declaratively patch it.

It took a little while to find the magic, but here it is a snippet of code that exemplifies my findings. The sample does the following:

- Grabs an existing class (Test, in this case)
- Changes an instance method to call the old instance method and then do something new.
- Changes a class method to call the old class method and then do something new.

The code is nasty looking for class methods in Ruby 1.8.x, but it works.
If you use Ruby 1.9 there is a cleaner way.

The dangers of ||= for booleans in Ruby

The ||= operator in Ruby is a fantastic shortcut to initialize a variable only once. It is used all the time to write something like:
    some_hash = {} if some_hash.nil?
into a much shorter and more compact:
    some_hash ||= {}
However, it is extremely important to NEVER use this operator to initialize a boolean flag this way. In fact, it is tempting to write something like:

Installing RVM on Lion OSX


If you tried to install RVM on Lion OSX you probably ran into some problems.

The "official" command to use to do the installation works just fine:
$ bash -s stable < <(curl -s https://raw.github.com/wayneeseguin/rvm/master/binscripts/rvm-installer )
What doesn't work is when you try to install say Ruby 1.9.3 from source.