CloudGene Tutorial

Tutorial leaders: Sebastian Schoenherr <sebastian.schoenherrATuibk.ac.at> and Lukas Forer <lukas.forerATi-med.ac.at>.

Prerequisites

This tutorial will give an introduction to Hadoop MapReduce and shows how different applications can be integrated into our workflow system Cloudgene. A Cloudgene virtual image is available providing a pre-configured environment with all the dependencies (Hadoop Cluster CDH3u6; Eclipse; Cloudgene). This can be imported into e.g. VirtualBox, and all exercises will be run in this. If you do not have VirtualBox installed, you can find it here: https://www.virtualbox.org/

The actual virtual machine image is available here: Please download and extract it before the workshop as it is quite large. To import this into VirtualBox (4.3.20) go to "File" > "Import Appliance" and select the ova file.

Login to the VM with username: "adminuser" and password: "adminuser".

Tutorial

1) HDFS - Hadoop Distributed File System

HDFS is the the default Hadoop file system. Its a distributed file system designed to run on commodity hardware. HDFS is the input source for all MapReduce jobs.
Execute the following commands. All command can be found here.

  • Create a directory: hadoop fs -mkdir <hdfs-path>

  • put a file into that directory: hadoop fs -put <local-path> <hdfs-path>

  • List the contents of the directory: hadoop fs -ls <hdfs-path>

  • Export folder data from HDFS to POSIX: hadoop fs -getmerge <hdfs-path> .

2) First Hadop MapReduce Program: WordCount

WordCount is a simple application that counts the number of occurrences of each word in a given input set.

  • Open Eclipse (Shortcut on Desktop) for the code written in Java

  • Import Project

    File -> Import -> Git -> Project from Git -> Local -> Select Project
    • Map: The Mapper implementation processes one line at a time. It then splits the line into tokens separated by whitespaces and emits a key-value pair of <word>, 1>.

    • Reduce: The Reducer implementation, via the reduce method just sums up the values

  • Run pom.xml creates a jar file in the target directory.

3) Execute Hadoop MapReduce use case on command line

  • Execute WordCount / BioWordocunt:

    • hadoop jar <jar-file> <Main-Method> <hdfs-input> <hdfs-output>
      hadoop fs -getmerge <hdfs-results> <local-path>

    • hadoop jar /home/adminuser/git/mapreduce/Examples/target/Samples-0.1.jar wordcount.WordCount workspace/seb/wordcount-input workspace/seb/out
      hadoop fs -getmerge workspace/seb/out wordcount-out-local.txt

    • hadoop jar /home/adminuser/git/mapreduce/Examples/target/Samples-0.1.jar biowordcount.BioWordCount workspace/seb/biowordcount-input workspace/seb/out1
      hadoop fs -getmerge workspace/seb/out1 bio-out-local.txt
  • Open http://localhost:50030 for information on the executed job

4) Executing MapReduce with Cloudgene

  • Connect WordCount with Cloudgene: Write a simple YAML file including steps, input and output as well as parameters.
    The file can be found here: vi /home/adminuser/cloudgene/apps/uppsala/wordcount.yaml

  • Now, start Cloudgene and execute Wordcount including HDFS data import:
    cd /home/adminuser/cloudgene;
    ./cloudgene --mode private

  • Login into Cloudgene: http://localhost:8082 (seb/seb)

  • Execute WordCount

5) Cloudgene - Other Usecases

We integrated different applicatons into Cloudene.

  • Find them in /home/adminuser/cloudgene/apps/uppsala and look for *.yaml files
  • Try to execute the other applications within Cloudgene (YAML only means that you can't actually execute it right now)
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r5 - 2015-01-20 - EInfraMPS2015Org
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2015 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback