hadoop - How to run a spark java program -


i have written java program spark. how run , compile unix command line. have include jar while compiling running

combining steps official quick start guide , launching spark on yarn get:

we’ll create simple spark application, simpleapp.java:

/*** simpleapp.java ***/ import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.function;  public class simpleapp {   public static void main(string[] args) {     string logfile = "$your_spark_home/readme.md"; // should file on system     javasparkcontext sc = new javasparkcontext("local", "simple app",       "$your_spark_home", new string[]{"target/simple-project-1.0.jar"});     javardd<string> logdata = sc.textfile(logfile).cache();      long numas = logdata.filter(new function<string, boolean>() {       public boolean call(string s) { return s.contains("a"); }     }).count();      long numbs = logdata.filter(new function<string, boolean>() {       public boolean call(string s) { return s.contains("b"); }     }).count();      system.out.println("lines a: " + numas + ", lines b: " + numbs);   } } 

this program counts number of lines containing ‘a’ , number containing ‘b’ in text file. note you’ll need replace $your_spark_home location spark installed. scala example, initialize sparkcontext, though use special javasparkcontext class java-friendly one. create rdds (represented javardd) , run transformations on them. finally, pass functions spark creating classes extend spark.api.java.function.function. java programming guide describes these differences in more detail.

to build program, write maven pom.xml file lists spark dependency. note spark artifacts tagged scala version.

<project>   <groupid>edu.berkeley</groupid>   <artifactid>simple-project</artifactid>   <modelversion>4.0.0</modelversion>   <name>simple project</name>   <packaging>jar</packaging>   <version>1.0</version>   <repositories>     <repository>       <id>akka repository</id>       <url>http://repo.akka.io/releases</url>     </repository>   </repositories>   <dependencies>     <dependency> <!-- spark dependency -->       <groupid>org.apache.spark</groupid>       <artifactid>spark-core_2.10</artifactid>       <version>0.9.0-incubating</version>     </dependency>   </dependencies> </project> 

if wish read data hadoop’s hdfs, need add dependency on hadoop-client version of hdfs:

<dependency>   <groupid>org.apache.hadoop</groupid>   <artifactid>hadoop-client</artifactid>   <version>...</version> </dependency> 

we lay out these files according canonical maven directory structure:

$ find . ./pom.xml ./src ./src/main ./src/main/java ./src/main/java/simpleapp.java 

now, can execute application using maven:

$ mvn package $ mvn exec:java -dexec.mainclass="simpleapp" ... lines a: 46, lines b: 23 

and follow steps launching spark on yarn:

building yarn-enabled assembly jar

we need consolidated spark jar (which bundles required dependencies) run spark jobs on yarn cluster. can built setting hadoop version , spark_yarn environment variable, follows:

spark_hadoop_version=2.0.5-alpha spark_yarn=true sbt/sbt assembly 

the assembled jar this: ./assembly/target/scala-2.10/spark-assembly_0.9.0-incubating-hadoop2.0.5.jar.

the build process supports new yarn versions (2.2.x). see below.

preparations

  • building yarn-enabled assembly (see above).
  • the assembled jar can installed hdfs or used locally.
  • your application code must packaged separate jar file.

if want test out yarn deployment mode, can use current spark examples. spark-examples_2.10-0.9.0-incubating file can generated running:

sbt/sbt assembly  

note: since documentation you’re reading spark version 0.9.0-incubating, assuming here have downloaded spark 0.9.0-incubating or checked out of source control. if using different version of spark, version numbers in jar generated sbt package command different.

configuration

most of configs same spark on yarn other deploys. see configuration page more information on those. these configs specific spark on yarn.

environment variables:

  • spark_yarn_user_env, add environment variables spark processes launched on yarn. can comma separated list of environment variables, e.g.
spark_yarn_user_env="java_home=/jdk64,foo=bar" 

system properties:

  • spark.yarn.applicationmaster.waittries, property set number of times applicationmaster waits the spark master , number of tries waits spark context intialized. default 10.
  • spark.yarn.submit.file.replication, hdfs replication level files uploaded hdfs application. these include things spark jar, app jar, , distributed cache files/archives.
  • spark.yarn.preserve.staging.files, set true preserve staged files(spark jar, app jar, distributed cache files) @ end of job rather delete them.
  • spark.yarn.scheduler.heartbeat.interval-ms, interval in ms in spark application master heartbeats yarn resourcemanager. default 5 seconds.
  • spark.yarn.max.worker.failures, maximum number of worker failures before failing application. default number of workers requested times 2 minimum of 3.

launching spark on yarn

ensure hadoop_conf_dir or yarn_conf_dir points directory contains (client side) configuration files hadoop cluster. used connect cluster, write dfs , submit jobs resource manager.

there 2 scheduler mode can used launch spark application on yarn.

launch spark application yarn client yarn-standalone mode.

the command launch yarn client follows:

spark_jar=<spark_assembly_jar_file> ./bin/spark-class org.apache.spark.deploy.yarn.client \   --jar <your_app_jar_file> \   --class <app_main_class> \   --args <app_main_arguments> \   --num-workers <number_of_worker_machines> \   --master-class <applicationmaster_class>   --master-memory <memory_for_master> \   --worker-memory <memory_per_worker> \   --worker-cores <cores_per_worker> \   --name <application_name> \   --queue <queue_name> \   --addjars <any_local_files_used_in_sparkcontext.addjar> \   --files <files_for_distributed_cache> \   --archives <archives_for_distributed_cache> 

for example:

# build spark assembly jar , spark examples jar $ spark_hadoop_version=2.0.5-alpha spark_yarn=true sbt/sbt assembly  # configure logging $ cp conf/log4j.properties.template conf/log4j.properties  # submit spark's applicationmaster yarn's resourcemanager, , instruct spark run sparkpi example $ spark_jar=./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.5-alpha.jar \     ./bin/spark-class org.apache.spark.deploy.yarn.client \       --jar examples/target/scala-2.10/spark-examples-assembly-0.9.0-incubating.jar \       --class org.apache.spark.examples.sparkpi \       --args yarn-standalone \       --num-workers 3 \       --master-memory 4g \       --worker-memory 2g \       --worker-cores 1  # examine output (replace $yarn_app_id in following "application identifier" output previous command) # (note: yarn_app_logs_dir /tmp/logs or $hadoop_home/logs/userlogs depending on hadoop version.) $ cat $yarn_app_logs_dir/$yarn_app_id/container*_000001/stdout pi 3.13794 

the above starts yarn client programs start default application master. sparkpi run child thread of application master, yarn client periodically polls application master status updates , displays them in console. client exit once application has finished running.

with mode, application run on remote machine application master run upon. application involve local interaction not work well, e.g. spark-shell.


Comments

Popular posts from this blog

Android layout hidden on keyboard show -

google app engine - 403 Forbidden POST - Flask WTForms -

c - Why would PK11_GenerateRandom() return an error -8023? -