java


How to extend Lucene's StandardAnalyzer for custom special character treatment?


I'm using Lucene's StandardAnalyzer for a specific index property.
As special characters like àéèäöü do not get indexed as expected, I want to replace these characters:
à -> a
é -> e
è -> e
ä -> ae
ö -> oe
ü -> ue
What is the best approach to extend the org.apache.lucene.analysis.standard.StandardAnalyzerclass?
I was looking for a way where the standard parser iterates over all tokens (words) and I can retrieve word by word and do the magic there.
Thanks for any hints.
I would propose to use MappingCharFilter, that will allow to have a map of Strings that will be replaces by Strings, so it will fit your requirements perfectly.
Some additional info - https://lucene.apache.org/core/6_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html
You wouldn't extend StandardAnalyzer, since analyzer implementations are final. The meat of an analyzer implementation is the createComponents method, and you would have to override that anyway, so you wouldn't get much from extending it anyway.
Instead, you can copy the StandardAnalyzer source, and modify the createComponents method. For what you are asking for, I recommend adding ASCIIFoldingFilter, which will attempt to convert UTF characters (such as accented letters) into ASCII equivalents. So you could create an analyzer something like this:
Analyzer analyzer = new Analyzer() {
#Override
protected TokenStreamComponents createComponents(final String fieldName) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(src);
tok = new LowerCaseFilter(tok);
tok = new ASCIIFoldingFilter(tok); /*Adding it before the StopFilter would probably be most helpful.*/
tok = new StopFilter(tok, StandardAnalyzer.ENGLISH_STOP_WORDS_SET);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) {
src.setMaxTokenLength(StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
super.setReader(reader);
}
};
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream result = new StandardFilter(in);
result = new LowerCaseFilter(result);
tok = new ASCIIFoldingFilter(tok);
return result;
}
}

Related Links

How to get #Test methods from the test classes, when running TestNG programmatically
Using Java gcm-server to send message to topic
Code not matching the Applet
How to get all files in a directory with exclusion of subdirectories using regular expression?
Spring sql-error-codes.xml does not show correct exception on timeout
Approaches to Android MediaPlayer state mismatch error on isPlaying()?
Issue with Android SQLite cursor while reading table having blob content
Is there a non-jdk implementation for a concurrent weak hash map?
how to immplement dynamic parameter on Jmeter with java
What to do when catching e(rror) in Java
apache.servicemix.bundles.quickfix - Attempting to create standalone NewOrderSingle object throws exception
Printing a message if a char has more than one symbol(using char, not String)
Stop Eclipse to show documents when mouse is hover over words in double quotation
What exactly does Android's #hide annotation do?
Java - Server listens on many ports
Should I use one SQLiteHelper for all SQLiteDatabases in my app or one for each one of them?

Categories

HOME
raspberry-pi
session
jira
programming-languages
packages
retrofit
adfs
vifm
nano-server
imacros
phaser
dtrace
iolanguage
vaadin7
opentracing
windows-7-x64
serilog
invantive-sql
apache-metamodel
undefined
sms-gateway
firefox-webextensions
jprofiler
one-hot-encoding
physics-engine
kudan
firebase-crash-reporting
microsoft-sync-framework
apache-commons-io
wpfdatagrid
phpfox
filezilla
ejabberd-module
quote
r-raster
contact-form
framemaker
mmenu
host
serve
espeak
theano.scan
reportbuilder
streamreader
greenrobot-eventbus
fedex
ibpy
firebase-admin
heightmap
wso2carbon
azure-application-gateway
chain-builder
python-webbrowser
veracode
typed-lambda-calculus
flow-control
imanage
spring-mongodb
objective-c-swift-bridge
contact-list
clang-static-analyzer
nsviewcontroller
essence
asp.net5
pyke
unity-networking
feedback
fouc
dukescript
drawbitmap
riak-cs
fscommand
java.util.concurrent
id3v2
tld
datagridviewcolumn
balanced-payments
farseer
html4
dataadapter
hungarian-algorithm
enterprisedb
flashvars
cassini-dev
free-variable
word-processor
twrequest
genshi
coredump
tomcat-valve
dbal
sustainable-pace
gacutil
w3wp
ugc
.net-1.0

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App