solr - textual content without metadata from Tika via SolrCell -
using solr 3.6 , extractionrequesthandler (aka tika), possible map just textual content (of pdf) field minus metadata? "content" field produced tika unfortunately contains metadata munged in text content of document.
i provide snippet highlighting of content , subject metadata within content field skewing highlight results.
update: screenshot of tika output indexed solr. highlighted portion block of metadata gets prepended block of text pdf content.

the extractingrequesthandler in solrconfig.xml:
<requesthandler name="/update/extract" startup="lazy" class="solr.extraction.extractingrequesthandler"> <lst name="defaults"> <str name="lowernames">true</str> <str name="uprefix">ignored_</str> </lst> </requesthandler> schema.xml fields. note "content" receives tika's content output directly. "page" , "collection" fields set literal values when doc posted handler.
<field name="id" type="string" indexed="true" stored="true" required="true"/> <field name="title" type="text_general" indexed="true" stored="true" multivalued="true"/> <field name="subject" type="text_general" indexed="true" stored="true" multivalued="true"/> <field name="content" type="text_general" indexed="true" stored="true" multivalued="true"/> <field name="collection" type="text_general" indexed="true" stored="true"/> <field name="page" type="tint" indexed="true" stored="true"/> <field name="timestamp" type="date" indexed="true" stored="true" default="now" multivalued="false"/>
as other answers irrelevant, i'll post mine:
i have experienced same problem op describes, (solr 4.3.0, custom config, custom schema, etc. i'm not newbie or , understand solr internals pretty well)
this was erh config:
<requesthandler name="/update/extract" startup="lazy" class="solr.extraction.extractingrequesthandler" > <lst name="defaults"> <str name="uprefix">ignored_</str> <str name="fmap.a">ignored_</str> <str name="fmap.div">ignored_</str> <str name="fmap.content">text</str> <str name="captureattr">false</str> <str name="lowernames">true</str> <bool name="ignoretikaexception">true</bool> </lst> </requesthandler> it configured ignore except content (i believe it's reasonable many people).
after careful investigation found out,
<str name="captureattr">false</str> was thing caused op's issue. default turned on, turned off did not need anyway. , mistake. have no idea why, causes solr put extracted attributes fmap.content field altogether extracted text.
so solution turn on. final erh:
<requesthandler name="/update/extract" startup="lazy" class="solr.extraction.extractingrequesthandler" > <lst name="defaults"> <str name="uprefix">ignored_</str> <str name="fmap.a">ignored_</str> <str name="fmap.div">ignored_</str> <str name="fmap.content">text</str> <str name="captureattr">true</str> <str name="lowernames">true</str> <bool name="ignoretikaexception">true</bool> </lst> </requesthandler> now, extracted text put fmap.content field.
unfortunately have not found piece of documentation can explain this. either bug or stupid behavior
Comments
Post a Comment