java


How to extend Lucene's StandardAnalyzer for custom special character treatment?


I'm using Lucene's StandardAnalyzer for a specific index property.
As special characters like àéèäöü do not get indexed as expected, I want to replace these characters:
à -> a
é -> e
è -> e
ä -> ae
ö -> oe
ü -> ue
What is the best approach to extend the org.apache.lucene.analysis.standard.StandardAnalyzerclass?
I was looking for a way where the standard parser iterates over all tokens (words) and I can retrieve word by word and do the magic there.
Thanks for any hints.
I would propose to use MappingCharFilter, that will allow to have a map of Strings that will be replaces by Strings, so it will fit your requirements perfectly.
Some additional info - https://lucene.apache.org/core/6_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html
You wouldn't extend StandardAnalyzer, since analyzer implementations are final. The meat of an analyzer implementation is the createComponents method, and you would have to override that anyway, so you wouldn't get much from extending it anyway.
Instead, you can copy the StandardAnalyzer source, and modify the createComponents method. For what you are asking for, I recommend adding ASCIIFoldingFilter, which will attempt to convert UTF characters (such as accented letters) into ASCII equivalents. So you could create an analyzer something like this:
Analyzer analyzer = new Analyzer() {
#Override
protected TokenStreamComponents createComponents(final String fieldName) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(src);
tok = new LowerCaseFilter(tok);
tok = new ASCIIFoldingFilter(tok); /*Adding it before the StopFilter would probably be most helpful.*/
tok = new StopFilter(tok, StandardAnalyzer.ENGLISH_STOP_WORDS_SET);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) {
src.setMaxTokenLength(StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
super.setReader(reader);
}
};
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream result = new StandardFilter(in);
result = new LowerCaseFilter(result);
tok = new ASCIIFoldingFilter(tok);
return result;
}
}

Related Links

Transaction on two tables at the same time in two different databases
Intellij console not printing logs while debugging java application
Extract int from stream in Java
Processing, bullet position for firing
Error:(44) No resource identifier found for attribute 'fab_labelStyle' in package 'abtech.waiteriano.com.waitrer'
Adding a maven library into Netbeans Web Project
Menu Bar, Color, and Image not adding to frame
How to change response http header in get request by spring RestTemplate?
Is there any way to check user session once authenticated on navigating HTML pages in Spring MVC?
Using JavaMail with SSL and TLS
Password encryption in Spring MVC
Store boolean using sharedpreferences in android [duplicate]
Javacc How can i make a variable accessible to scanner and parser
I want to convert httpresponse to json object
Display bill total, Java public void display() error
How to reuse the variable used for for loop first level inside for loop second level in nested for loop

Categories

HOME
twitter
cloud
gerrit
image-processing
onedrive
session
routes
elm
jxls
gps
ip
onelogin
indesign
mouse
swagger-ui
vifm
midi
quickbooks
google-cloud-spanner
vb.net-2010
pc
pythonanywhere
jtextfield
swiftlint
graphicsmagick
tapestry
claims-based-identity
spark-jobserver
facet
facebook-access-token
neo4j-spatial
buck
windowbuilder
binary-data
http-status-code-503
mozilla
android-nestedscrollview
lto
c11
xenforo
ssjs
bootstrapper
squib
modelmapper
eclipse-gef
sas-jmp
qwt
termination
strptime
ionicons
nodebb
typed-lambda-calculus
linode
django-scheduler
static-ip-address
vmware-tools
database-optimization
multi-level
xml-attribute
design-by-contract
oauth2client
goose
associative-array
freedesktop.org
intel-fortran
django-unittest
apache-commons-fileupload
createprocessasuser
iiviewdeckcontroller
key-management
winddk
javafx-webengine
xcode-6.2
code-access-security
starcluster
htmlcleaner
mechanize-ruby
multiprocessor
page-layout
oam
inbox
limejs
html-editor
dice
php-parser
joyent
oncheckedchanged
lang
sortable-tables
dmx512
bespin
compiler-specific
firefox4
avatar
boost-smart-ptr
zune
data-acquisition
paul-graham

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App