Search This Blog

Friday, April 2, 2021

Great Expectations with Databricks - Data Quality Matters

Managing data quality, testing and profiling data with Databricks is something often asked for when dealing with Data Assets.  Testing code and applying code coverage metrics is common practice, how about coverage on data?

There are a few tools out there to manage testing, profiling, and managing quality of data pipelines.  In this post I'll talk about one Python tool, Great Expectations, and an awesome blog from a data scientist working with Spark and tools like Great Expectations.

Great Expectations

Great Expectations is a pipeline data quality and data profiling library and scaffolding tool.  

If you're comfortable working with Spark or Pandas dataframes, you would be comfortable working with this framework.  In my initial view, the set up isn't quite notebook-friendly with it's wizard-based prompts.  Best to try running locally first.  There is also a lot going on with this framework.  Take the time to dig into its features.

Once I got the framework installed, I was quickly able to setup both Spark and Pandas dataframe Expectations.  Expectations are assertions about your data, and can be packaged into Suites.  Great Expectations presents a lot of expectations for use in automated data testing and profiling.  Here's a list, the Glossary of Expectations.

Before you start writing code to validate some json in a column, check out expect_column_values_to_be_json_parseable or expect_column_values_to_match_json_schema.

Are you working on a machine learning project and need to verify some results statistically?  expect_column_stdev_to_be_between, expect_column_proportion_of_unique_values_to_be_between, use the at_least or at_most features, or perhaps something more?

Kullback-Leibeler divergence?
expect_column_kl_divergence_to_be_less_than

bootstrapped Kolmogorov-Smirnov test? expect_column_bootstrapped_ks_test_p_value_to_be_greater_than
expect_column_parameterized_distribution_ks_test_p_value_to_be_greater_than


Chi-squared test?
expect_column_chisquare_test_p_value_to_be_greater_than

Note that some of these expectations may have big data issues until they mature a bit more.
See https://github.com/great-expectations/great_expectations/issues/2277

Datasources can be used to interact with Batches of data, and apply Validators to evaluate Expectations or Expectation Suites.

Checkpoints are used to validate, test, and perform other actions.  Stores and the Data Context configuration provides locations to configuration, metrics, validation results, and documentation.

Great Expectations could also be considered the Sphinx docs tool for data.  It includes Site and Page builders, Renderers, and other tools to auto-generate documentation for data Batches.

A Profiler is available for scaffolding expectations and building collections of metrics.

Great Expectations is compatible with Databricks and Pyspark.  I was also able to get portions of the framework setup in Google Colaboratory, with Spark and Airflow(!), for experimentation.

https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_data_contexts/how_to_instantiate_a_data_context_on_a_databricks_spark_cluster.html

One addition to the framework is a Data Dictionary plugin.  If you're using comments or metadata in tables and columns, or would like to manage these separately for Data Assets, this could be one tool to look at.  Another would be services such as Azure Purview.

There's also a markdown renderer, so you can publish your data documentation to a code wiki or browse with tools like https://typora.io/.

Here's the latest documentation on Read the Docs.

Justin Matters

Justin Matters, a data scientist and developer from Edinburgh, UK, has some excellent articles on Databricks and Pyspark that may help with standardizing data pipelines, testing and data quality.  I highly recommend reading his blog posts.  Here's a few I've put in my sandbox for later testing.

Refactoring code with curried functions
https://justinmatters.co.uk/wp/building-a-custom-data-pipeline-using-curried-functions/


Spark gotchas and nullability


SQL to Pyspark Cheat Sheet

Are Dataframes Equal


Try These Out with Databricks

Databricks provides a community edition to get you started, spin up an instance in Azure, AWS, and GCP.  






Wednesday, January 31, 2018

Service Auto-Start for Ambari and HDP/HDF clusters

The first time you reboot a Hortonworks HDP/HDF cluster node, you will notice some services do not auto-start by default.  This may include Ambari Server and Agent, depending on how it was initially configured.

These can be managed in a few different ways.
https://community.hortonworks.com/content/supportkb/151076/how-to-enable-ambari-server-auto-start-on-rhelcent.html

Ambari UI makes it easy with the auto-start feature.

https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.0.0/bk_ambari-operations/content/enable_service_auto_start.html

https://cwiki.apache.org/confluence/display/AMBARI/Recovery%3A+auto+start+components
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=41812517

There are some related JIRAs and documentation around this.
https://issues.apache.org/jira/browse/AMBARI-10029
https://community.hortonworks.com/questions/825/how-to-write-cluster-startup-shutdown-order-and-sc.html

Note that order of startup of services for larger clusters, and some delays starting services may cause issues with this feature.  I have seen situations where HDFS needs a little more time to successfully start before other services can properly register themselves.

Saturday, December 16, 2017

IBM and Hortonworks Consolidate Offerings at DataWorks Summit

At DataWorks Summit this year, a few announcements were made.  One in particular further consolidates the Hadoop distributions and makes Hortonworks Data Platform (HDP) an even more compelling offering.

https://hortonworks.com/press-releases/ibm-hortonworks-expand-partnership/

IBM and Hortonworks are both members of the ODPi, and now they are offering IBM Data Science Experience and IBM Big SQL as packaged offerings with HDP.

In addition, IBM is migrating BigInsights customers to HDP, consolidating IBM BigIntegrate, IBM BigQuality, and IBM Information Governance Catalog into Apache Atlas, and continuing to contribute to open source platforms including Apache Spark and SystemML.

IBM has at least 4 official Apache Spark committers with 2 official committers from Hortonworks.  When I looked at this list in April, 2014, neither company had committers.  The list of committers has almost doubled since then.  Mridul Muralidharam joined Hortonworks from Yahoo!, Nick Pentreath joined IBM from Mxit, Prashant Sharma joined IBM from Databricks.

IBM, Databricks, and Hortonworks are by far the top contributing companies to PySpark 2.0.  Two years ago IBM went all-in on Spark, calling it "Potentially the Most Significant Open Source Project of the Next Decade"

Another announcement was the inclusion of Hortonworks Registry for Kafka, Storm and NiFi.  Similar to https://github.com/confluentinc/schema-registry it distinguishes itself from the competition by providing pluggable storage of schemas in MySql or Postgres, a web-based UI, search capabilities.

The question that popped into my head right away is why didn't they just extend the Hive metastore to become the Schema Registry for all things streaming, and provide tumbling windows on Kafka and Storm from Hive?  This would have been an awesome addition to the Hive StorageHandlers.

There's always HiveKa if anyone wants to pick it up...

The latest HDF 3.0 was announced.  One component that brought some excitement was the generically-named Streaming Analytics Manager.  It's gui-based design is a bit similar to NiFi, with the addition of Dashboards, the aforementioned Schema Registry, and monitoring views.  This tool tries to democratize the creation and managment of streaming data sources.

Data in motion is the story of 2017 and beyond.


Spark Classes and Resources

There's a-lot of material available for Spark MLlib (RDD based API) - this API may be deprecated with next release i.e. 2.3 ....
https://cognitiveclass.ai/courses/spark-mllib/

Spark ML is Dataframes based API - there are less training resources than core Spark  - MOOCs on edx/datacamp/udemy

Spark ML training at Strata  (full videos are available on safaribooksonline.com) and few more on safari from various authors/publications.

Great resource for anything Spark
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-mllib/spark-mllib-pipelines.html

https://mapr.com/training/certification/mcsd/opic-centric list of high-quality open datasets

https://github.com/caesar0301/awesome-public-datasets

Subscribe to Spark email list or review archives. 

http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark+ml&days=0&sort=date
https://spark.apache.org/community.html

Databricks is the founding organization of Spark and largest contributor.
https://databricks.com/training/courses/apache-spark-for-machine-learning-and-data-science

UC Berkeley, Hortonworks, IBM, and Cloudera are other top Spark committers. 

Berkeley has some courses, granddaddy of MLLib.
http://mlbase.org/

Hortonworks
https://hortonworks.com/apache/spark/

IBM
https://www.ibm.com/ca-en/marketplace/spark-as-a-service

Cloudera
https://university.cloudera.com/instructor-led-training/introduction-to-machine-learning-with-spark-ml-and-mllib (paid)

Deep Learning
https://github.com/databricks/spark-deep-learning

Databricks repos
https://github.com/databricks

Spark Roadmap
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-ml-roadmap-2-3-0-and-beyond-td22892.html#a22972


Certifications search on Github
https://github.com/search?l=Markdown&q=spark+ml+certification&type=Code&utf8=%E2%9C%93

Apache Spark Meetups
https://spark.apache.org/community.html

Friday, February 24, 2017

Azure Data Lake Analytics

Microsoft Azure Data Lake Analytics and Data Lake Store offerings provide an alternative and complimentary solution to Azure HDInsight & Hortonworks HDP.

 Azure Data Lake Analytics (ADLA) provides a U-SQL language (think Pig + SQL + C# + more) based on Microsoft's internal language Scope. Scope is used for tools like Bing Search. It has the same concepts as Hadoop - schema on read, custom reducers, extractors/SerDes, etc.  A component of ADLA is based on Microsoft internal job scheduler and compute engine, Cosmos. ADLA uses Apache YARN to schedule jobs and manage its in-memory components.

 Azure Data Lake Store (ADLS) is a blob storage layer for ADLA, which behaves more like HDFS and uses WebHDFS / Apache Hadoop behind the scenes. ADLA includes the concepts of Tables, Views, Stored Procedures, Table-Valued Functions, Partitions, and stores these types of objects in its internal metastore catalog, similar to Hive.

Currently ADLS supports TSV/CSV format, with extensions for JSON and the ability to write custom extractors against pretty much any format that you could read with .NET or the .Net SDK for Hadoop.

A USQL Script looks something like this:

DECLARE EXTERNAL @inputfile string = "myinputdir/myinputfile"

@indataset = EXTRACT 
col1 as string, 
col2 as int?
FROM @inputfile
USING Extractors.Tsv(skipFirstNRows:1, silent:false);

@outdataset = SELECT 
col1, 
(col2.Length == 0)? 0 : col2 AS isblankcol
FROM @indataset;

OUTPUT @outdataset TO @outputlocation
USING Outputters.Tsv(outputHeader : true, quoting: false);

One problem I have with USQL is the name.  Every search on Google comes back with "We searched for SQL. Did you mean USQL?"

USQL uses C# syntax and .Net data typing, and it includes code-behind and custom assemblies.
A USQL Script job can be submitted either locally for testing or to Azure Data Lake Analytics.  It is a batch process and there is limited interactive functionality.

For those familiar with using hdfs / hadoop commands, there is Python shell development in progress against ADLS with some familiar commands.

cat    chmod  close  du      get   help  ls     mv   quit  rmdir  touch
chgrp  chown  df     exists  head  info  mkdir  put  rm    tail

As with any Azure services, you can also use Azure Xpat Cli, Powershell & Web APIs.

Wednesday, November 2, 2016

What's trending in the world of GitHub and Open Source?

GitHub has a trove of information about its organizations, committers, repos, code and issues.

GitHub Archive maintains per-hour stats on 28 event types hooking into repo activities across the platform.  It includes things like a committers login, url, organization, followers, gists, starred repos, and a history of all your public coding activity.

The October 2016 archive table is 29M rows and 70GB of data.

In September 2016, Microsoft became the largest "open-source" contributor organization on Github, largely due to its custom API integration using Azure Services and rather elegant management system for its employees and repos.   If you can onboard all developers in a company the size of Microsoft, and automate repository setup and discovery, you will quickly become the largest contributor.

Microsoft beat out , Docker, Angular, Google, Atom, FortAwesome, Elastic, and even Apache.


What's trending in the world of GitHub and Open Source?

GitHub has a trove of information about its organizations, committers, repos, code and issues.

GitHub Archive maintains per-hour stats on 28 event types hooking into repo activities across the platform.  It includes things like a committers login, url, organization, followers, gists, starred repos, and a history of all your public coding activity.

The October 2016 archive table is 29M rows and 70GB of data.

In September 2016, Microsoft became the largest "open-source" contributor organization on Github, largely due to its custom API integration using Azure Services and rather elegant management system for its employees and repos.   If you can onboard all developers in a company the size of Microsoft, and automate repository setup and discovery, you will quickly become the largest contributor.

Microsoft beat out , Docker, Angular, Google, Atom, FortAwesome, Elastic, and even Apache.