spark – Parquet Files in Scala

Parquet Files are a great format for storing large tables in SparkSQL.

  • Consider converting text files with a schema into parquet files for more efficient storage.
  • Parquet provides a lot of optimizations under the hood to speed up your queries.
  • Just call .write.parquet on a DataFrame to encode in into Parquet.
case class MyCaseClass(key: String, group: String, value: Int, someints: Seq[Int], somemap: Map[String, Int])
val dataframe = sc.parallelize(Array(MyCaseClass("a", "vowels", 1, Array(1), Map("a" -> 1)),
  MyCaseClass("b", "consonants", 2, Array(2, 2), Map("b" -> 2)),
  MyCaseClass("c", "consonants", 3, Array(3, 3, 3), Map("c" -> 3)),
  MyCaseClass("d", "consonants", 4, Array(4, 4, 4, 4), Map("d" -> 4)),
  MyCaseClass("e", "vowels", 5, Array(5, 5, 5, 5, 5), Map("e" -> 5)))
).toDF()
// now write it to disk
dataframe.write.parquet("/tmp/testScalaParquetFiles")


val rdd = sqlContext.read.parquet("/tmp/testScalaParquetFiles")  
rdd.registerTempTable("parquetTable1")
%sql CREATE TABLE parquetTable2
USING org.apache.spark.sql.parquet
OPTIONS (
  path "/tmp/testScalaParquetFiles"
)

“Some other Parquet-producing systems, in particular Impala, Hive, and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.”

Within scala, you can call the following to set the configuration:

sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")

If using SQL, you can use the set key=value notation:

set spark.sql.parquet.binaryAsString=true