java - CSS styles and <ul> <li> tags been ignored while parsing using Apache Tika -
while parsing pdf or word document using autodetectparser "li", "ul" tags converted "p" tags. need exact html content been there pdf or word document.
i tried in several ways below:
tohtmlcontenthandler texthandler = new tohtmlcontenthandler(); metadata metadata = new metadata(); parser parser = new autodetectparser(); parsecontext context = new parsecontext(); context.set(htmlmapper.class, new identityhtmlmapper()); parser.parse(in, texthandler, metadata, context); saxtransformerfactory factory = (saxtransformerfactory)saxtransformerfactory.newinstance(); transformerhandler handler = factory.newtransformerhandler(); handler.gettransformer().setoutputproperty(outputkeys.method, "html"); handler.gettransformer().setoutputproperty(outputkeys.indent, "no"); handler.gettransformer().setoutputproperty(outputkeys.encoding, "utf-8"); handler.setresult(new streamresult(writer)); system.out.println(handler.tostring()); return handler; but "li" tags been replaced "p" tags class css style not seen in parsed html output.
any appreciated.
Comments
Post a Comment