Following the Elephant: 2015

Friday, December 11, 2015

Finding Meetup Resources and Presentations

The Bay Area Hadoop User Group meetup has over 5.2k members. Meetups like these are hugely popular and provide great resources, slides, etc.

Searching Google brings back 820+ results for PDFs related to Hadoop from Meetup files. Lots of great information here.

Google inurl:files.meetup.com Hadoop for Hadoop resources or any other meetup topics you might find interesting. Over 76,000 resources by searching for just inurl:files.meetup.com.

Filling up the Kindle...

Friday, November 6, 2015

Closer look at U-SQL, MIcrosoft's HiveQL

Microsoft U-SQL is the query language used on Azure Data Lake Analytics services. Based on SCOPE and Cosmos, which has been around since at least 2008, It combines C# type / expressions functionality, schema-on-read, custom processors and reducers into a SQL-like ETL and output language.

Keywords need to be upper case. The where clause uses C#-style == syntax. Rows can contain up to 4MB of data per row.

U-SQL supports SQL.MAP<k,v> and SQL.ARRAY(<T>)

U-SQL supports inline C# expressions, UDFs, UDAs to custom aggregate, UDOs to generate process and consume rowsets.

U-DOs are user-defined operators build with Visual Studio.
https://azure.microsoft.com/pt-pt/documentation/articles/data-lake-analytics-u-sql-develop-user-defined-operators/

It will be interesting to see if this language makes it into SQL Server itself. Extractors and Outputters would be highly useful to replace some of the functionality of SSIS.

I built a similar tool a few years ago for schema-on-read. It brought CSV files into BLOB columns in SQL Server (read my article on BLOBs on SQL Server Central) and allowed you to query them by converting to nvarchar(max), applying a schema, and then outputting to a table.

Kind of felt like a data lake at the time.... though it wasn't massively parallel and didn't have any kind of map-reduce job spinning up. Then MS introduced the filestream object...

Tuesday, October 27, 2015

Virtualbox error - Kernel driver after Centos Update

On my Centos7 box, after an update I lost the kernel sources. Virtualbox would no longer start a VM due to updates requiring a recompile.

Running usr/sbin/rcvboxdrv setup
showed some errors in cat /var/log/vbox-install.log
After removing & reinstalling kernel sources and running above command again, Virtualbox recompiled the kernel.

yum remove kernel-devel gcc
yum install kernel-devel gcc

Unfortunately this may remove some dependencies also, backup your environment!

Then had to reboot to avoid the "Creating a process..." message for VirtualBox.

Friday, October 23, 2015

ZSH and Oh-My-Zsh Shell Plugins

I remember a long, long time ago, in a galaxy far, far away, I played around with setting up custom DOS prompts. Memories of Ansi.sys and custom ANSI art come streaming into my brain...

Forget all that. On CentOS, these two commands will install the Z Shell and Oh-My-Zsh

yum install zsh
sh -c "$(wget https://raw.github.com/robbyrussell/oh-my-zsh/master/tools/install.sh -O -)"

Tab allows you to visualize potential paths, running processes, ls without hitting enter, and other awesomeness.

There are one or two Themes and Plugins available.

Some laundry lists, tricks and cheat sheets.

If you're running Windows, 720MB of Babun will get you Zsh among other things...

Sunday, October 4, 2015

Hue on HDInsight and HDInsight on Linux

Microsoft might have just made Data Lakes a commodity offering.

Convergence with the Linux realm is happening again at Microsoft with the introduction of Hue on HDInsight (a graphical interface for Hadoop/HDP) and HDInsight on Linux. Hue has been around for quite awhile in the Apache realm and in most Hadoop distros, glad do see HDInsight is finally getting a user-friendly GUI.

Another announcement introduces U-SQL (see Michael Rys (@MikeDoesBigData) Introducing U-SQL). A SQL-like, Hive/Pig/Grep/Awk combo language to ELT+QE (Extract/Load/Transform + Query/Extract) on top of the HDInsight Big Data Lake.

The biggest announcement is the Azure Data Lake itself...

Shared folders with Virtualbox and Centos 7

Got a build error with VirtualBox add-ins and HDP Sandbox 2.2.

Building the main Guest Additions module [FAILED]

Fixed by checking the log file for errors, Missing kernel directory issue.

$ export KERN_DIR=/lib/modules/2.6.32-504.1.3.el6.x86_64/

$ cd /usr/src/kernels

$ ln -s /usr/src/kernels/2.6.32-573.7.1.el6.x86_64/ 2.6.32-504.1.3.el6.x86_64

Thursday, September 24, 2015

Multi tab Putty

Just as it sounds, multiple tabbed putty.
http://ttyplus.com/

And for Windows, there's Clover
http://ejie.me/

Sunday, September 20, 2015

Scala kernel for Jupyter notebook and randomness

Some links for Jupyter Notebook setup and randomness...

The Jupyter Scala kernel from Alexandre Archambault
https://github.com/alexarchambault/jupyter-scala

The Jupyter Spark kernel from Brian Schlining
https://github.com/hohonuuli/sparknotebook

Something random from Bryan - convert Genome information to Midi music.
https://github.com/hohonuuli/dna-music/blob/master/README

Something else, video to ascii with akka streams?
https://github.com/hohonuuli/streamerz

Jupyter Server
https://github.com/jupyter/jupyterhub

Setting up public server
http://jupyter-notebook.readthedocs.org/en/latest/public_server.html

Java 9 Kernel for some REPL prototyping
https://github.com/Bachmann1234/java9_kernel
http://blog.takipi.com/5-features-in-java-9-that-will-change-how-you-develop-software-and-2-that-wont/

Adding R to Jupyter
http://ihrke.github.io/jupyter.html

Saturday, September 12, 2015

Jupyter, iPython and AVG Antivirus

Linux tools like cygwin and python don't play well with security and Windows.

Not sure why I bother trying to work with linux apps on a wintel box anyway, but if anyone else is...

Exclude c:\python34 directory in virus scanner
http://useragent.xyz/lost-and-important-file-for-python-34/
to
https://try.jupyter.org/
by running
pip install jupyter
for
https://github.com/zabirauf/icsharp
by running
choco install icsharp
which installs Python 3.4.3... sigh
and throws some 404 error
and installs 2 / 4 packages,
not including icsharp.

Looks like http://python-distribute.org/distribute_setup.py is for sale. :)

git clone https://github.com/zabirauf/icsharp

If you're not familiar with iPython, now called Jupyter to be language agnostic, it is pretty awesome and distributed peer programming will never be the same.
http://www.nature.com/news/ipython-interactive-demo-7.21492

https://www.authorea.com/
https://cloud.sagemath.com/
https://wakari.io/

Now if I could only get the shift-enter compile execution shortcut working with OneNote and C#, I would be so happy.
http://tryroslyn.azurewebsites.net/

Monday, August 17, 2015

Engine Noise and the Internet of Things

According to a recent blog post by Stephen Few, Data Visualization Guru, "The exponential growth in raw data that we’re experiencing is mostly producing noise."

I used to be a car audiophile of sorts. It was mainly about the highest tweets and lowest subs. Surprisingly my hearing didn't get permanently damaged, though I did crack a windshield and shake off my rear view mirror a few times. I still have one of these in my garage...

On a recent road trip, I introduced my kids to Pink Floyd's Dark Side of The Moon. I realized during the intro to Money that the right half of my speakers weren't putting anything out. I hadn't noticed prior to this, since most of the music I listen to now is on the radio and is really just noisy filler for my commute.

Unlike my faulty right door speaker, a pilot might notice more if the right half of their aircraft wasn't putting out any power. I don't think they would even need any instruments to tell them something is wrong. A few years ago, there was a frequently quoted IoT statistic put out about Boeing 787's creating a half-terabyte of data per flight. That's about 12,500x the amount of data a plane from 1977 might generate.

After all this data is generated, the results need to be interpreted by the plane's computer, the pilot, and ground crew and actioned on. Sometimes in real-time, sometimes even predictively. Perhaps 85% of the data could be considered noise. That's still about 75GB of data to scan for each flight. If it's text data it could be shrunk down to under 10GB. Not quite big data anymore if we get rid of the noise.

I couldn't find a sample of this data, though I did find this report on Noise data for the first 17 months of Boeing 787 operations at Heathrow airport.

According to the study, the Dreamliner is 3-8db quieter than similar aircraft. That's about the equivalent of someone breathing, though I guess at sustained time intervals and multiplied by the number of aircraft in flight it could make a difference. The study might have been helped (or hindered) by including some visual and audio samples for reference.

Rather than capturing hundreds of statistics and spending months and countless dollars studying flight patterns, a great gig in the sky, the better metric might have been "is it quieter than the Concorde?"

How can we determine signals from so much noise?

Wednesday, June 24, 2015

Garbage In, Garbage In, Garbage In

Many projects in the Apache ecosystem run Java. One of the places developers spend time in when dealing with performance issues is the Java Virtual Machine's (JVM) Garbage Collection options. When the heap becomes full, garbage is collected.

In this past, I have seen that .NET apps that explicitly call the garbage collector improved performance, especially when dealing with black-box code that doesn't dispose of objects itself nicely or bloats memory due to poor design. I have also seen where it will destroy performance for every .NET application on the machine.

In .NET 4.6 RC,

Enhancements to garbage collection (GC)

The GC class now includes TryStartNoGCRegion and EndNoGCRegion methods that allow you to disallow garbage collection during the execution of a critical path.

A new overload of the GC.Collect(Int32, GCCollectionMode, Boolean, Boolean) method allows you to control whether both the small object heap and the large object heap are swept and compacted or swept only.

So it seems people are still trying to trick the garbage truck to show up on the wrong day to pick up that rusty mattress or old toilet, or make sure that the garbage truck doesn't pass by when they're in the middle of running out the door with a million Glad bags.

http://stackoverflow.com/questions/118633/whats-so-wrong-about-using-gc-collect

At this point, suppose that performance plays a fundamental role and the slightest alteration in the program's flow could bring catastrophic consequences. Object creation is then reduced to the minimum possible by using object pools and the such but then, the GC chimes in unexpectedly and throws it all away, and someone dies.

Well that got dark really fast, stackoverflow.

Oracle has a good document around the concepts of the Heap and the Nursery. When the nursery fills up, the older ones leave to public school. When public school fills up, the oldest are forced out into the real world.

https://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/geninfo/diagnos/garbage_collect.html

Databricks, the Spark folks, and Intel, recently posted a great article about how GC works with Spark and how to tune Spark instances for optimized JVM garbage collection which inspired (and augmented some content for) this post.

https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

$500 in Google Cloud Credit with Free MapR Hadoop Training

What does MapR get from Google? $110 million in capital financing.

What do you get a Google Cloud Engine $500 free credit with MapR training? Apparently quite a bit...

Compute Engine

434.524 total hours per month
VM class: Regular
Instance type: n1-highmem-16
Region: United States
Total Estimated Cost: $438.00

SSD storage: 0 GB
Storage: 100 GB
Snapshot storage: 0 GB
$4.00

Egress - Americas/EMEA: 200 GB
Egress - Asia/Pacific: 0 GB
Egress - Australia: 0 GB
Egress - China: 0 GB
Google Cloud Interconnect United States: 0 GB
Google Cloud Interconnect Europe: 0 GB
Google Cloud Interconnect Asia/Pacific: 0 GB
Egress to a different Zone in the same Region: 0 GB
Egress to a different Region within the US: 0 GB
$24.00

Monthly total: $466.00

If you don't want 128GB of ram and 5 servers in your cluster, you could be a peon and buy some pre-emptible Instances to go the cheaper route.

https://cloud.google.com/compute/docs/instances/preemptible

Hadoop / HBase / Drill Training Link here...
https://www.mapr.com/company/press-releases/mapr-collaborates-google-cloud-platform-offer-500-credit-resources-mapr-fre-0

Sandbox VM download here
https://www.mapr.com/products/mapr-sandbox-hadoop/download-sandbox-drill

Wednesday, May 6, 2015

Elements of Scale

Amazing, comprehensive article around relational, NoSQL, and many other approaches to reading and writing information.

http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/

If a relational database can't solve a specific problem efficiently and timely, perhaps throwing the kitchen sink, or data platform at it could...

Search This Blog