java


How to extend Lucene's StandardAnalyzer for custom special character treatment?


I'm using Lucene's StandardAnalyzer for a specific index property.
As special characters like àéèäöü do not get indexed as expected, I want to replace these characters:
à -> a
é -> e
è -> e
ä -> ae
ö -> oe
ü -> ue
What is the best approach to extend the org.apache.lucene.analysis.standard.StandardAnalyzerclass?
I was looking for a way where the standard parser iterates over all tokens (words) and I can retrieve word by word and do the magic there.
Thanks for any hints.
I would propose to use MappingCharFilter, that will allow to have a map of Strings that will be replaces by Strings, so it will fit your requirements perfectly.
Some additional info - https://lucene.apache.org/core/6_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html
You wouldn't extend StandardAnalyzer, since analyzer implementations are final. The meat of an analyzer implementation is the createComponents method, and you would have to override that anyway, so you wouldn't get much from extending it anyway.
Instead, you can copy the StandardAnalyzer source, and modify the createComponents method. For what you are asking for, I recommend adding ASCIIFoldingFilter, which will attempt to convert UTF characters (such as accented letters) into ASCII equivalents. So you could create an analyzer something like this:
Analyzer analyzer = new Analyzer() {
#Override
protected TokenStreamComponents createComponents(final String fieldName) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(src);
tok = new LowerCaseFilter(tok);
tok = new ASCIIFoldingFilter(tok); /*Adding it before the StopFilter would probably be most helpful.*/
tok = new StopFilter(tok, StandardAnalyzer.ENGLISH_STOP_WORDS_SET);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) {
src.setMaxTokenLength(StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
super.setReader(reader);
}
};
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream result = new StandardFilter(in);
result = new LowerCaseFilter(result);
tok = new ASCIIFoldingFilter(tok);
return result;
}
}

Related Links

How to use LDAP authentication for the Exchange Web Services connection in Java?
How to set JAVA_HOME in Mac permanently?
LIMIT 1 with CrudRepository
Difference between import java.util.* and java.util.XXXX [duplicate]
Error in function split.default
Extracting source code metadata from compiler
Hamcrest.Matchers: JSON Nested Array
Selectively allocate dimensions for multi-dimensional array at construction
Consume localhost webservice from remote application
Java conventions about primitive types [closed]
Problems with current thread
Disable a KeyTyped event based on its type
PropertyChange Listener for icons
Using JMS, how can I get the maximum message size for a particular destination?
Spring 3.0.4 Not Executing Scheduled-Task Properly
Close JDialog from JavaFx Button

Categories

HOME
testing
osgi
listview
gerrit
gremlin
mockito
pycharm
objectgears
getelementsbytagname
sqlite-net-extensions
yarn
jsrender
electronics
medical
iggrid
metatrader4
volttron
google-cloud-ml
windows-azure-storage
jplayer
hex-editors
carthage
vaadin7
django-admin
finite-automata
facebook-page
invantive-sql
beyondcompare
visual-composer
autocad-plugin
dcevm
kvc
tibco-mdm
emgucv
one-to-many
pepper
uninstall
filezilla
rst2pdf
fifo
gitignore
madlib
.net-4.0
google-sites-2016
hybridauth
event-driven
unobtrusive-validation
user-controls
ncalc
password-encryption
import-from-excel
menuitem
hibernate-tools
strptime
react-scripts
dds
domain-model
idisposable
snmptrapd
np-complete
media-player
akka-cluster
pearson
or-tools
eventkit
cubes
ado.net-entity-data-model
player
storekit
gstreamer-0.10
persist
skype4py
python-stackless
rvest
hexdump
pyke
natvis
msys
google-places
cannon.js
coveralls
cakephp-3.1
service-accounts
twirl
system32
dereference
winddk
relocation
titanium-modules
flask-cors
jquery-layout
yorick
hippomocks
dealloc
bundles
oracle-warehouse-builder
industrial
mysqltuner
quantlib-swig
inbox
tws
bulkloader
pushbackinputstream
ember-app-kit
delphi-6
cascalog
regsvr32
semantic-diff
gnu-prolog
recent-documents
yui-datatable
jmock
cxxtest
nhibernate.search
perfect-hash
compiler-specific
virtual-functions
swing-app-framework
sector
ugc
.net-1.0

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App