Search This Blog

Sunday, May 18, 2014

Buzz about Hive - ACID Support and Query Optimization

Apache Hive™ is a distributed Data Warehousing solution for Hadoop that includes a HiveQL language for a SQL-like experience.  It's not SQL, nor an Oracle, SQL or Teradata warehouse.

The Hadoop Platforms Team at Yahoo! has announced they are backing Hive (coming from Facebook) and the features of Full ACID Support and Cost-based Query Optimizations. Features like these would bring Hive closer to the world of relational databases, with all the benefits of being a large-scale distributed data store capable of holding structured, semi-structured and unstructured data.

It's fascinating to see the things that we take for granted in relational databases being built from the ground up in other tools such as Hive, with the community discussing problems that were resolved in many other database platforms 20+ years ago. 

If you want to compare SQL or Oracle to Hive, it's probably best not to. Hive doesn't include many of the features found in the more mature database platforms.  Queries can take some time, even (and especially) with tiny datasets, and are designed for batch processes. What it has going for it is cost of storage and distribution of workload. 

DB-Engines Ranking has Hive currently at #18 on its popularity ranking chart for databases.  This ranking should be taken with a grain of salt, as the list compares apples to oranges.  Would you rank Microsoft Access with Oracle on the same page?  They are different systems with different purposes, audiences, and scalability features. 

Ignoring all that, Hive would be ranked #12 if only classified with the Relational Database category. That puts it ahead of SAP HANA and dBase.  Did I just say dBase?  Yes I did.  You can buy it for DOS with the original 1994 documentation for $99.  And a DOS emulator to run it on.  I may have a copy in my basement I can sell you too... along with my Intellivision.

The DB-Engines site classifies Hive as a Relational Database, which it is not.  A relational database defines a primary key for a relationship within a table, and foreign keys in related tables for associating back to said primary key.  Hive currently has no concept of primary keys or relationships, which gets me a bit stressed about manageability of data. 

Something that us DBAs take for granted such as Oracle Sequences or SQL Server Auto-identity columns doesn't exist in Hive.

It's only a matter of time though...

Here some of the JIRAs related to ACID Support and Cost-based Query Optimizations. 

Cost-based Query Optimizations
https://issues.apache.org/jira/browse/HIVE-1938

Relationships & Sequences
https://issues.apache.org/jira/browse/HIVE-6905

ACID Updates
https://issues.apache.org/jira/browse/HIVE-5317

In other news, Qubole, founded last year by some of the members of the Facebook Hive team, has announced Presto as a service, another SQL language for Hadoop that operates 10x faster than Hive, or at least that's the quoted marketing metric. 

Thursday, May 15, 2014

115 Projects and Mountains of Data

Have you heard of Hadoop?  Sure you have.  You're reading this, aren't you?

My colleague, who is going for every Hadoop certification available, has kindly provided some links to add to my 150MB OneNote notebook on the Hadoop and Apache ecosystems.  That's about 1.8 blocks of HDFS data (replicated 3 times) if you haven't adjusted the default size and are using MR2.  I'll try to share some of them on this site.


The list of projects out there doesn't quite qualify as big data but is still getting pretty unmanageable for me.  Apache alone has 115 projects listed, though some are shelved and haven't been updated in awhile, and only about 11 are categorized as "Big Data."


I'm currently pursuing one certification for now, and focusing a bit more on some of the amazing tools out there that work with the core infrastructure.  I will try to share some of my findings on this blog for anyone who might find it helpful.


If you're going to get certified in the core of Hadoop, you'll want to understand Java programming and MapReduce theory. This could change in the future, as MapReduce slowly gets relegated to the mines of Mordor, with YARN treating it as a tenant in a larger domain of heterogeneous applications.  The possibility of running different MR versions, or even doing away with MR and going with one of the other 7 Dwarves (or perhaps 13) as a core piece of the architecture is a serious concern.


Speaking of Mordor, an Oliphaunt is a large war elephant from Lord of the Rings.  



The New York Times has an article from 1984 called "The Mystery of Hannibal's Elephants."   Hannibal had a 38-node cluster of War Elephants, and crossed the alps with those elephants and 100,000 men (give or take 60,000 or so, Wikipedia has a different number).

There are currently 129 people considered as Apache committers who contribute to > 10 projects each.  That's about 3% of the 3500 or so committers listed on the Apache site.  The top two committers, Jim Jagielski and Dr. Chris Mattmann have contributed to at least 35 different projects each.  The Apache ecosystem is an amazing community with some very dedicated and passionate individuals.  However, there is an even larger "dark pool" of talent branching and forking open-source code for their own needs within the silos of companies like Twitter, Intel, eBay, Linkedin, IBM, Facebook, Google, Yahoo, and yes even Microsoft.  


The cute elephant in the room of 2006 is turning into a herd of war elephants that will crush relational database systems as we know them.

Or so they say...



I will either find a way, or make one.
-- Hannibal