java - CSS styles and <ul> <li> tags been ignored while parsing using Apache Tika -

July 15, 2012

while parsing pdf or word document using autodetectparser "li", "ul" tags converted "p" tags. need exact html content been there pdf or word document.

i tried in several ways below:

tohtmlcontenthandler texthandler = new tohtmlcontenthandler(); metadata metadata = new metadata(); parser parser = new autodetectparser(); parsecontext context = new parsecontext(); context.set(htmlmapper.class, new identityhtmlmapper()); parser.parse(in, texthandler, metadata, context);

saxtransformerfactory factory = (saxtransformerfactory)saxtransformerfactory.newinstance(); transformerhandler handler = factory.newtransformerhandler(); handler.gettransformer().setoutputproperty(outputkeys.method, "html"); handler.gettransformer().setoutputproperty(outputkeys.indent, "no"); handler.gettransformer().setoutputproperty(outputkeys.encoding, "utf-8"); handler.setresult(new streamresult(writer)); system.out.println(handler.tostring()); return handler;

but "li" tags been replaced "p" tags class css style not seen in parsed html output.

any appreciated.

Search This Blog

Funaction

java - CSS styles and <ul> <li> tags been ignored while parsing using Apache Tika -

Comments

Post a Comment

Popular posts from this blog

java - Play! framework 2.0: How to display multiple image? -

gmail - Is there any documentation for read-only access to the Google Contacts API? -

php - Controller/JToolBar not working in Joomla 2.5 -