Amazon Hadoop EMR & custom input file format -
i having bit of trouble getting amazon emr accepting custom inputfileformat:
public class main extends configured implements tool { public static void main(string[] args) throws exception { int res = toolrunner.run(new jobconf(), new main(), args); system.exit(res); } public int run(string[] args) throws exception { path inputpath = new path(args[0]); path outputpath = new path(args[1]); system.out.println("input path: "+inputpath+"\n"); system.out.println("output path: "+outputpath+"\n"); configuration conf = getconf(); job job = new job(conf, "processdocs"); job.setjarbyclass(main.class); job.setmapperclass(map.class); job.setreducerclass(reducer.class); job.setmapoutputkeyclass(text.class); job.setmapoutputvalueclass(text.class); job.setinputformatclass(xmlinputformat.class); textinputformat.setinputpaths(job, inputpath); textoutputformat.setoutputpath(job, outputpath); job.waitforcompletion(true); return 0; } } looking @ log file:
2012-06-04 23:35:20,053 info org.apache.hadoop.mapred.jobclient (main): default number of map tasks: null 2012-06-04 23:35:20,054 info org.apache.hadoop.mapred.jobclient (main): setting default number of map tasks based on cluster size : 6 2012-06-04 23:35:20,054 info org.apache.hadoop.mapred.jobclient (main): default number of reduce tasks: 1 2012-06-04 23:35:20,767 info org.apache.hadoop.mapreduce.lib.input.fileinputformat (main): total input paths process : 1 2012-06-04 23:35:20,813 info com.hadoop.compression.lzo.gplnativecodeloader (main): loaded native gpl library 2012-06-04 23:35:20,886 warn com.hadoop.compression.lzo.lzocodec (main): not find build properties file revision hash 2012-06-04 23:35:20,886 info com.hadoop.compression.lzo.lzocodec (main): loaded & initialized native-lzo library [hadoop-lzo rev unknown] 2012-06-04 23:35:20,906 warn org.apache.hadoop.io.compress.snappy.loadsnappy (main): snappy native library available 2012-06-04 23:35:20,906 info org.apache.hadoop.io.compress.snappy.loadsnappy (main): snappy native library loaded 2012-06-04 23:35:22,240 info org.apache.hadoop.mapred.jobclient (main): running job: job_201206042333_0001 it seems hadoop on emr assumes default inputfileformat reader... doing wrong?
note: not errors hadoop regarding availability of xmlinputclass. *note2: * <property><name>mapreduce.inputformat.class</name><value>com.xyz.xmlinputformat</value></property> in jobs/some_job_id.conf.xml file.
update:
public class xmlinputformat extends textinputformat { public static final string start_tag_key = "xmlinput.start"; public static final string end_tag_key = "xmlinput.end"; public recordreader<longwritable,text> createrecordreader(inputsplit split, taskattemptcontext context) throws ioexception { system.out.println("creating new 'xmlrecordreader'"); return new xmlrecordreader((filesplit) split, context.getjobconf()); } /* @override public recordreader<longwritable,text> getrecordreader(inputsplit inputsplit, jobconf jobconf, reporter reporter) throws ioexception { return new xmlrecordreader((filesplit) inputsplit, jobconf); } */ /** * xmlrecordreader class read through given xml document output xml * blocks records specified start tag , end tag * */ public static class xmlrecordreader implements recordreader<longwritable,text> { private final byte[] starttag; private final byte[] endtag; private final long start; private final long end; private final fsdatainputstream fsin; private final dataoutputbuffer buffer = new dataoutputbuffer(); public xmlrecordreader(filesplit split, jobconf jobconf) throws ioexception { starttag = jobconf.get(start_tag_key).getbytes("utf-8"); endtag = jobconf.get(end_tag_key).getbytes("utf-8"); system.out.println("xmlinputformat: start tag: "+starttag); system.out.println("xmlinputformat: end tag : "+endtag); // open file , seek start of split start = split.getstart(); end = start + split.getlength(); path file = split.getpath(); filesystem fs = file.getfilesystem(jobconf); fsin = fs.open(split.getpath()); fsin.seek(start); } ...
if xmlinputformat not part of same jar contains main(), you'll need either build in "subfolder" called "lib" of main jar, or create bootstrap action copies jar containing xmlinputformat s3 magic folder /home/hadoop/lib part of hadoop classpath default on emr.
it not assuming fileinputformat, abstract.
based on edits, think premise of question wrong. suspect input format indeed found , used. system.out.println task attempt not end in syslog of job, although might appear in stdout digest.
Comments
Post a Comment