java


How to extend Lucene's StandardAnalyzer for custom special character treatment?


I'm using Lucene's StandardAnalyzer for a specific index property.
As special characters like àéèäöü do not get indexed as expected, I want to replace these characters:
à -> a
é -> e
è -> e
ä -> ae
ö -> oe
ü -> ue
What is the best approach to extend the org.apache.lucene.analysis.standard.StandardAnalyzerclass?
I was looking for a way where the standard parser iterates over all tokens (words) and I can retrieve word by word and do the magic there.
Thanks for any hints.
I would propose to use MappingCharFilter, that will allow to have a map of Strings that will be replaces by Strings, so it will fit your requirements perfectly.
Some additional info - https://lucene.apache.org/core/6_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html
You wouldn't extend StandardAnalyzer, since analyzer implementations are final. The meat of an analyzer implementation is the createComponents method, and you would have to override that anyway, so you wouldn't get much from extending it anyway.
Instead, you can copy the StandardAnalyzer source, and modify the createComponents method. For what you are asking for, I recommend adding ASCIIFoldingFilter, which will attempt to convert UTF characters (such as accented letters) into ASCII equivalents. So you could create an analyzer something like this:
Analyzer analyzer = new Analyzer() {
#Override
protected TokenStreamComponents createComponents(final String fieldName) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(src);
tok = new LowerCaseFilter(tok);
tok = new ASCIIFoldingFilter(tok); /*Adding it before the StopFilter would probably be most helpful.*/
tok = new StopFilter(tok, StandardAnalyzer.ENGLISH_STOP_WORDS_SET);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) {
src.setMaxTokenLength(StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
super.setReader(reader);
}
};
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream result = new StandardFilter(in);
result = new LowerCaseFilter(result);
tok = new ASCIIFoldingFilter(tok);
return result;
}
}

Related Links

Invoking Eclipse plugin from Java
Error while running a map-reduce job which reads elasticsearch
ActionBar issue: NPE
How to identify 4th occurance of a character using java regex
Estimate BigDecimal Calculation time
How to use session cache to fire actions on url using groovy/ java
Calendar giving wrong day of week
Recognizing underline and strike through attribute during iText pdf processing
Encrypt files in python and decrypt in android
Java Regex: check if a sentence contains only alphabet and numbers
Parse simple JSON array using JSONArray
Alternative to Java Enum
box2d : camera and body not synced
Why does hamcrest any(Myclass.class) requires casting
Search class by author in IntellijIdea
Getting java.lang.ClassNotFoundException

Categories

HOME
ms-access
java
ibm-bluemix
openlayers
osgi
cookies
hashmap
lodash
relay
at-command
cmd
cakephp-2.5
ezpublish
enterprise-library-5
datastax-java-driver
indesign
constraint-programming
multiple-records
autotools
cross-validation
webrequest
synchronization
cloudkit
gnupg
volttron
windows-azure-storage
msp430
circuit
iron-router
jsprit
dosgi
systemc
jtextfield
swingx
custom-wordpress-pages
realex-payments-api
mmap
assistant
unboundid
react-chartjs
jspresso
calibre
y86
force-layout
bower-install
bitbucket-pipelines
mapzen
avro4s
sql-server-agent
gulp-sourcemaps
gpx
galleria
worksheet
forever
sqlclient
snmptrapd
windows-iot-core-10
sonarlint-vs
gridpane
url-masking
etsy
merge-conflict-resolution
plottable.js
nsviewcontroller
packagist
jms2
ruby-2.2
retina
security-testing
nsight
composite
dereference
oxwall
phalanger
phpthumb
doskey
marmalade
floating-point-precision
drools-planner
stage
tomcat-valve
calling-convention
sudzc
gamma
preference
ncqrs
meego
firefox4
temporal-database
mdac
paul-graham

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App