SparkMultiTool

Tools for spark which we use on the daily basis. It contains:

Loader of HDFS files with combining small files (uses Hadoop CombineTextInputFormat/CombineFileInputFormat)
Future: cosine calculation
Future: quantile calculation

#Requirements This library was succeffully tested with Scala 2.11.8 and Spark 2.3.1. You should install SBT:

SBT tool

#Build This build based on Scala 2.11.8 and Spark 2.3.1. Edit build.sbt If you have another environment.

For building install sbt, launch a terminal, change current to sparkmultitool directory and launch a command:

sbt package
sbt test

Next copy spark-multitool*.jar from ./target/scala-2.11/... to the lib folder of your sbt project.

#Usage Include spark-multitool*.jar in --jars path in spark-submit like this:

spark-submit --master local --executor-memory 2G --class "Tst" --num-executors 1 --executor-cores 1 --jars lib/spark-multitool_2.11-0.9.jar target/scala-2.11/tst_2.11-0.1.jar

See examples folder.

##Loaders ru.retailrocket.spark.multitool.Loaders - combine input files before mappers by means of Hadoop CombineTextInputFormat/CombineFileInputFormat. In our case it reduced the number of mappers from 100000 to approx 3000 and made job significantly faster. Parameters:

path - path to the files (as in spark.textFile)
size - size of target partition in Megabytes. Optimal value equals to a HDFS block size
delim - line delimiters

This example loads files from "/test/*" and combine them in mappers.

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

import ru.retailrocket.spark.multitool.Loaders._

object Tst {
	def main(args: Array[String]) = {
	  val conf = new SparkConf().setMaster("local").setAppName("My App")
	  val sc = new SparkContext("local", "My App")

    val path = "file:///test/*"

    {
      val sessions = sc
        .forPath(path)
        .setSplitSize(256) // optional
        .setRecordDelim("\n") // optional
        .combine()
	    println(sessions.count())
    }

    {
      // you can also get RDD[(String, String)] with (file, line)
      val sessions = sc
        .forPath(path)
        .combineWithPath()
	    println(sessions.count())

      {
        // or add path filter, e.g. for partitioning
        class FileNameEqualityFilter extends Filter {
          def check(rules: Traversable[Filter.Rule], path: Array[String]) = {
            rules.forall {
              case(k, Array(eq)) =>
                k match {
                  case "file" => eq == path.last
                  case _ => false
                }
            }
          }
        }
        val sessions = sc
          .forPath(path)
          .addFilter(classOf[FileNameEqualityFilter], Seq("file" -> Array("file.name")))
          .combine()
	      println(sessions.count())
      }
    }
  }
}

##Algorithms

ru.retailrocket.spark.multitool.algs.cosine - cosine similarity function.

##Utility

ru.retailrocket.spark.multitool.HashFNV - simple, but useful hash function. Original idea from org.apache.pig.piggybank.evaluation.string.HashFNV

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
example		example
project		project
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SparkMultiTool

About

Releases

Packages

Contributors 3

Languages

License

RetailRocket/SparkMultiTool

Folders and files

Latest commit

History

Repository files navigation

SparkMultiTool

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages