Following the Elephant: June 2014

Monday, June 23, 2014

Presentations from the Apache Accumulo Summit 2014

"Up to 10 quadrillion entries in a single table"

That's 10,000,000,000,000,000 rows.

Sounds like a limitation to me...

Presentations from the Accumulo Summit. Accumulo is the Apache implementation of Google BigTable. http://www.slideshare.net/AccumuloSummit

Information on Hawq & the Accumulo Connector, Ambari, Slider, YARN, TinkerPop, etc.

The TinkerPop stack with Blueprints is my favourite project suite to read about, if only because of the cartoon mascots in their architecture diagrams. Every project team needs a graphics designer like Ketrina Yim, improve morale and adoption in the community. So many projects could benefit from the experiences of a designer rather than a programmer when it comes to building user-friendly applications and branding. Tech projects often take themselves much too seriously.

Would you rather learn more about Graph Server XI or Rexster? I thought so...

- Ketrina Yim, TinkerPop stack

Visit for the information, stay for the nice graphs about Accumulo adoption in the community and the 172-slide deck from Aaron Cordova on scaling Accumulo clusters, with lots of examples of truly "Big Data".

1 Year of Large Hadron Collider = 15PB
1 Year of Twitter = 182 trillion tweets & 483TB
Netflix master = 3.14 PB (Pie!)
WoW = 1.3 PB
InternetArchive = 15 PB

That's not big data, This! is big data....

Friday, June 20, 2014

Hadoop'able Materialized Views

The smart teams working on Apache Optiq are promoting in-memory, discardable, materialized views as a potential source of performance improvements when dealing with large distributed datasets in Hadoop. Why not use up all that memory sitting in your Hadoop cluster?

A presentation on DMMQ here.
http://www.slideshare.net/julianhyde/discardable-inmemory-materialized-queries-with-hadoop

The DMMQ blog at Hortonworks
http://hortonworks.com/blog/dmmq/

The DDM blog at Hortonworks
http://hortonworks.com/blog/ddm/

Monday, June 9, 2014

Querying Hive, the "Microsoft Way"

Apache Hive is an abstraction tool for generation of MapReduce jobs in Hadoop, and a lightweight data warehousing tool providing schema on read capabilities and storage of metadata in its metastore. By default, it is stored in MySQL. In Microsoft Azure HDInsight, it is stored in Azure SQL.

Using a "SQL-like" HiveQL language you can write queries that can access data stored in a Hadoop cluster, either within the Hive warehouse (predefined metadata) or in external files (text or binary).

Microsoft has LINQtoHive support through the Hadoop SDK, for those developers who enjoy using LINQ as an abstraction to their data.

Go get LinqPad and try it out!

If you're lucky enough to already have the LinqPad Premium edition, you can do a NuGet on the assembly required directly from the Query Properties pane.

You'll need the following assemblies for this demo query. For testing purposes, I just installed them using Nuget in Visual Studio, then browsed to the folder containing the assembly in LinqPad.

install-package Microsoft.Hadoop.Hive
Install-Package Microsoft.AspNet.WebApi.Client -Version 4.0.20710
Install-Package Newtonsoft.Json

Once you've added the assemblies, you can run this C# statement, after replacing the URL, User ID & Password. The port is the WebHCat port where Hive / HCatalog is available.

var db = new HiveConnection(
webHCatUri: new Uri("http://<myhadoopclusterurl>:50111"),
userName: (string) "<myuserid>", password: (string) "<mypassword>");

var result = db.ExecuteHiveQuery("select * from access_logs");
result.Wait();

LinqPad is awesomeness...

Hadoop Summit 2014 presentations

Slides and presentations from the Hadoop Summit 2014 in San Jose are here.

To me, the most fascinating was Hadoop 2 @Twitter Elephant Scale and the size of the data being worked upon during the migration.

Search This Blog