Blog Post

...

Alfresco Solr: enable search of special characters

Alfresco’s out of the box search engine Solr is configured in a way that searching of special characters is disabled by default. So, for instance, if you would like to find all documents in a system containing “c++” string then you may get tons (depending on the uploaded documents amount) of results that contain “C”-character without “++”-signs in its name and content: document content Fusce dapibus, tellus ac cursus commodo, search key c++, results “Fusce”, “ac”, “cursus”, “commodo”. In order to configure the correct search behavior please follow down to read current post.

Special characters in Solr are: - && || ! ( ) { } [ ] ^ " ~ * ? : \ and we want them to be searched. First of all we need to tweak Solr tokenization settings. Tokenization is a process that parses uploaded or existing documents and extracts tokens from document fields (e.g. content, name) to be added to search index. In Alfresco by default solr.ICUTokenizerFactory tokenizer is used, which excludes special characters from being added to search index. To overcome this, we can replace current tokenizer with the another one which preserve special characters, such tokenizer can be org.apache.solr.analysis.WhitespaceTokenizerFactory, it splits words by whitespaces. To do so please go to %Alfresco%\solr4\workspace-SpacesStore\conf\schema.xml, update text___ field type tokenization setting:

<fieldType name="text___" class="solr.TextField" positionIncrementGap="100">
 <analyzer>
    <!--tokenizer class="solr.ICUTokenizerFactory"/-->
    <tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" />
 </analyzer>
</fieldType>

Here’s field type text___ is a type that is set to some document fields. At this time we already have WhitespaceTokenizerFactory set to text___ field type. In this way if we want, for example, a content field of any document to be searchable including special characters – need to replace content field type with the text___ type (in the same file):

<dynamicField name="content@s____@*"          type="identifier"        indexed="true" omitNorms="true"   stored="false"  multiValued="false" termPositions="false" />
<dynamicField name="content@s__l_@*"          type="alfrescoFieldType"        indexed="true" omitNorms="true"   stored="false"  multiValued="false" termPositions="false" />
<!-- commented in the old setting -->
<!-- dynamicField name="content@s__lt@*"          type="alfrescoFieldType" indexed="true" omitNorms="false"  stored="false"  multiValued="false" /-->
<!-- new setting appended: -->
<dynamicField name="content@s__lt@*"          type="text___" indexed="true" omitNorms="false"  stored="false"  multiValued="false" />
<dynamicField name="content@s___t@*"          type="text___"           indexed="true" omitNorms="false"  stored="false"  multiValued="false" />

That’s it, after Solr index if fully cleared and recreated, Solr will have index that include special characters, plus search query will also come without special chars losses to perform Solr search.

Comments (0)

Tags: alfresco


0 comments

Leave a Comment