Amazon Hadoop EMR & custom input file format -

January 15, 2015

i having bit of trouble getting amazon emr accepting custom inputfileformat:

public class main extends configured implements tool {      public static void main(string[] args) throws exception {         int res = toolrunner.run(new jobconf(), new main(), args);         system.exit(res);     }       public int run(string[] args) throws exception {          path inputpath = new path(args[0]);         path outputpath = new path(args[1]);          system.out.println("input  path: "+inputpath+"\n");         system.out.println("output path: "+outputpath+"\n");          configuration conf = getconf();         job job = new job(conf, "processdocs");          job.setjarbyclass(main.class);          job.setmapperclass(map.class);         job.setreducerclass(reducer.class);          job.setmapoutputkeyclass(text.class);         job.setmapoutputvalueclass(text.class);          job.setinputformatclass(xmlinputformat.class);          textinputformat.setinputpaths(job, inputpath);         textoutputformat.setoutputpath(job, outputpath);          job.waitforcompletion(true);          return 0;     }    }

looking @ log file:

2012-06-04 23:35:20,053 info org.apache.hadoop.mapred.jobclient (main): default number of map tasks: null 2012-06-04 23:35:20,054 info org.apache.hadoop.mapred.jobclient (main): setting default number of map tasks based on cluster size : 6 2012-06-04 23:35:20,054 info org.apache.hadoop.mapred.jobclient (main): default number of reduce tasks: 1 2012-06-04 23:35:20,767 info org.apache.hadoop.mapreduce.lib.input.fileinputformat (main): total input paths process : 1 2012-06-04 23:35:20,813 info com.hadoop.compression.lzo.gplnativecodeloader (main): loaded native gpl library 2012-06-04 23:35:20,886 warn com.hadoop.compression.lzo.lzocodec (main): not find build properties file revision hash 2012-06-04 23:35:20,886 info com.hadoop.compression.lzo.lzocodec (main): loaded & initialized native-lzo library [hadoop-lzo rev unknown] 2012-06-04 23:35:20,906 warn org.apache.hadoop.io.compress.snappy.loadsnappy (main): snappy native library available 2012-06-04 23:35:20,906 info org.apache.hadoop.io.compress.snappy.loadsnappy (main): snappy native library loaded 2012-06-04 23:35:22,240 info org.apache.hadoop.mapred.jobclient (main): running job: job_201206042333_0001

it seems hadoop on emr assumes default inputfileformat reader... doing wrong?

note: not errors hadoop regarding availability of xmlinputclass. *note2: * <property><name>mapreduce.inputformat.class</name><value>com.xyz.xmlinputformat</value></property> in jobs/some_job_id.conf.xml file.

update:

public class xmlinputformat extends textinputformat {    public static final string start_tag_key = "xmlinput.start";   public static final string end_tag_key = "xmlinput.end";    public recordreader<longwritable,text> createrecordreader(inputsplit split, taskattemptcontext context) throws ioexception {        system.out.println("creating new 'xmlrecordreader'");        return new xmlrecordreader((filesplit) split, context.getjobconf());   }    /*   @override   public recordreader<longwritable,text> getrecordreader(inputsplit inputsplit,                                                          jobconf jobconf,                                                          reporter reporter) throws ioexception {     return new xmlrecordreader((filesplit) inputsplit, jobconf);   }   */    /**    * xmlrecordreader class read through given xml document output xml    * blocks records specified start tag , end tag    *     */   public static class xmlrecordreader implements recordreader<longwritable,text> {     private final byte[] starttag;     private final byte[] endtag;     private final long start;     private final long end;     private final fsdatainputstream fsin;     private final dataoutputbuffer buffer = new dataoutputbuffer();      public xmlrecordreader(filesplit split, jobconf jobconf) throws ioexception {       starttag = jobconf.get(start_tag_key).getbytes("utf-8");       endtag = jobconf.get(end_tag_key).getbytes("utf-8");        system.out.println("xmlinputformat: start tag: "+starttag);       system.out.println("xmlinputformat: end tag  : "+endtag);        // open file , seek start of split       start = split.getstart();       end = start + split.getlength();       path file = split.getpath();       filesystem fs = file.getfilesystem(jobconf);       fsin = fs.open(split.getpath());       fsin.seek(start);     }     ...

if xmlinputformat not part of same jar contains main(), you'll need either build in "subfolder" called "lib" of main jar, or create bootstrap action copies jar containing xmlinputformat s3 magic folder /home/hadoop/lib part of hadoop classpath default on emr.

it not assuming fileinputformat, abstract.

based on edits, think premise of question wrong. suspect input format indeed found , used. system.out.println task attempt not end in syslog of job, although might appear in stdout digest.

Search This Blog

Funaction

Amazon Hadoop EMR & custom input file format -

Comments

Post a Comment

Popular posts from this blog

java - Play! framework 2.0: How to display multiple image? -

gmail - Is there any documentation for read-only access to the Google Contacts API? -

php - Controller/JToolBar not working in Joomla 2.5 -