In this new year 2016, we should be excited that Apache Spark community have released and announced the availability of Apache Spark 1.6, which is the 7th release on the 1.x line.
- Committers – Contributors to Spark had crossed 1000, which is doubled.
- Patches – Apache Spark 1.6 version includes & covers 1000 patches.
- Run SQL query on files – This feature helps user and application to run SQL queries on files directly without create a table. And it’s similar to the feature available in Apache Drill. For an example select id from json.`path/to/json/files` as j.
- Star (*) expansion for StructTypes – This features makes it easier to nest and unnest arbitrary numbers of columns. It is pretty common for customers to do regular extractions of update data from an external datasource (e.g. mysql or postgres). While this is possible today in the new release with some small improvements to the analyzer. And goal is to allow users to execute the following two queries as well as their dataframe equivalents to find the most recent record for each key to unnest the struct from above group by query.
- Parquet Performance – It has been the most commonly used data formats with in the Apache Spark, and Parquet scan performance has pretty big impact on many large applications. Continue Reading