solr - textual content without metadata from Tika via SolrCell -

August 15, 2010

using solr 3.6 , extractionrequesthandler (aka tika), possible map just textual content (of pdf) field minus metadata? "content" field produced tika unfortunately contains metadata munged in text content of document.

i provide snippet highlighting of content , subject metadata within content field skewing highlight results.

update: screenshot of tika output indexed solr. highlighted portion block of metadata gets prepended block of text pdf content.

solr screenshot of tika output

the extractingrequesthandler in solrconfig.xml:

<requesthandler name="/update/extract" startup="lazy" class="solr.extraction.extractingrequesthandler">     <lst name="defaults">     <str name="lowernames">true</str>     <str name="uprefix">ignored_</str>     </lst> </requesthandler>

schema.xml fields. note "content" receives tika's content output directly. "page" , "collection" fields set literal values when doc posted handler.

<field name="id" type="string" indexed="true" stored="true" required="true"/> <field name="title" type="text_general" indexed="true" stored="true" multivalued="true"/> <field name="subject" type="text_general" indexed="true" stored="true" multivalued="true"/> <field name="content" type="text_general" indexed="true" stored="true" multivalued="true"/> <field name="collection" type="text_general" indexed="true" stored="true"/> <field name="page" type="tint" indexed="true" stored="true"/> <field name="timestamp" type="date" indexed="true" stored="true" default="now" multivalued="false"/>

as other answers irrelevant, i'll post mine:

i have experienced same problem op describes, (solr 4.3.0, custom config, custom schema, etc. i'm not newbie or , understand solr internals pretty well)

this was erh config:

  <requesthandler name="/update/extract"                    startup="lazy"                   class="solr.extraction.extractingrequesthandler" >     <lst name="defaults">       <str name="uprefix">ignored_</str>       <str name="fmap.a">ignored_</str>       <str name="fmap.div">ignored_</str>       <str name="fmap.content">text</str>       <str name="captureattr">false</str>        <str name="lowernames">true</str>       <bool name="ignoretikaexception">true</bool>     </lst>   </requesthandler>

it configured ignore except content (i believe it's reasonable many people).

after careful investigation found out,

<str name="captureattr">false</str>

was thing caused op's issue. default turned on, turned off did not need anyway. , mistake. have no idea why, causes solr put extracted attributes fmap.content field altogether extracted text.

so solution turn on. final erh:

  <requesthandler name="/update/extract"                    startup="lazy"                   class="solr.extraction.extractingrequesthandler" >     <lst name="defaults">       <str name="uprefix">ignored_</str>       <str name="fmap.a">ignored_</str>       <str name="fmap.div">ignored_</str>       <str name="fmap.content">text</str>       <str name="captureattr">true</str>        <str name="lowernames">true</str>       <bool name="ignoretikaexception">true</bool>     </lst>   </requesthandler>

now, extracted text put fmap.content field.

unfortunately have not found piece of documentation can explain this. either bug or stupid behavior

Search This Blog

Funaction

solr - textual content without metadata from Tika via SolrCell -

Comments

Post a Comment

Popular posts from this blog

java - Play! framework 2.0: How to display multiple image? -

gmail - Is there any documentation for read-only access to the Google Contacts API? -

php - Controller/JToolBar not working in Joomla 2.5 -