Search This Blog

Saturday, March 26, 2016

Performance and LLAP in Hive

Hive 2.0 introduces LLAP (Live Long and Process) functionality.  LLAP is a part of the Stinger.next initiative to address sub-second response times for interactive analytic queries.

The proposal for this feature is here.
https://issues.apache.org/jira/secure/attachment/12665704/LLAPdesigndocument.pdf

Interactive query response times are important when business intelligence tools directly query the Hive metastore.

When you execute a query in database engines like SQL Server or Oracle, the first time it can be expensive to run. Once the cache is warmed up, speed can increase dramatically.  This problem rears its head frequently with poor or non-reusable query execution plans that require the engine to go to disk and scan tables for every query rather than efficiently reusing plans and data caches.  System configurations, indexing strategies and statistics all contribute to the performance puzzle.

When you run a Hive distributed query using the Tez engine, it may spin up containers in YARN to process data in the cluster.  This process is relatively expensive to start up, and even though there is an option for Tez container re-use it isn't really caching fragments of the results or query access patterns for use across multiple sessions like SQL Server and other relational database engines provide.

There are many actions happening in the background, and it really doesn't make sense to do most of these actions for every interactive query.  JIT Optimization isn't really effective unless the Java process sticks around for awhile.

LLAP introduces optional daemons (long-running processes) on worker nodes to facilitate improvements to I/O, caching, and query fragment execution.  To reduce the complexity of installing the daemons on nodes, Slider can be used to distribute LLAP in the cluster as a long-running YARN application.

LLAP offers parallel execution of query fragments from different queries and sessions.

Metadata is cached in-memory on-heap, data is cached in column-chunks and persisted off heap, with YARN being responsible for management and allocation of resources.

More information

Stinger Next

Hadoop Summit 2015
http://www.slideshare.net/Hadoop_Summit/llap-longlived-execution-in-hive
Bay Area Hive Contributor Meetup Presentation.

Build LLAP and launch in a Slider container on HDP 2.3
https://gist.github.com/abajwa-hw/64bd19e3c93de97b73c6
https://www.snip2code.com/Snippet/832252/Build-LLAP-on-HDP-2-3




Sunday, March 6, 2016

Connection refused when starting MySQL

This appears to be a common issue with MySQL not accepting remote connections and cropped up for me a couple of times when installing Hortonworks HDP 2.4 and trying to use an existing MySQL for the Ambari database, Hive Metastore, Oozie and other Hadoop services.

Some steps taken to address the issue.

Confirm root access to mysql
https://www.digitalocean.com/community/questions/restoring-root-access-privileges-to-mysql

Check for running mysql processes and kill any that are running.
ps -A | grep mysql

Grant Remote Access
Change /etc/my.cnf adding a bind-address and port.
#/etc/my.cnf
bind-address=0.0.0.0 # this can be a static address if available.
port=3306

Restart service, in my case MariaDB on Centos7.
systemctl start mariadb

Check the log for errors.
cat /var/log/mariadb/mariadb.log

160306 12:04:52 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.44-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server


Create the Oozie and Hive Users & Databases.

Spin up the Hive Metastore.  Ambari will do this with a service restart or can test manually.
export HIVE_CONF_DIR=/usr/hdp/current/hive-metastore/conf/conf.server ; /usr/hdp/current/hive-metastore/bin/schematool -initSchema -dbType mysql -userName hive -passWord <enter_hive_password_here> -verbose

Helpful links