java


removing characters of a specific unicode range from a string


I have a program that is parsting tweets in real time from the twitter stream api. Before storing them, I am encoding them as utf8. Certain characters end up appearing in the string as ?, ??, or ??? instead of their respective unicode codes and cause problems. Upon further investigation, I found that the problematic characters are from the "emoticon" block, U+1F600 - U+1F64F, and the "Miscellaneous Symbols And Pictographs" block, U+1F300 - U+1F5FF. I tried removing, but was unsuccessful as the matcher ended up replacing almost every character in the string, not just my desired unicode range.
String utf8tweet = "";
try {
byte[] utf8Bytes = status.getText().getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Pattern unicodeOutliers = Pattern.compile("[\\u1f300-\\u1f64f]", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");
What can I do to remove these characters?
In the regex pattern add the negation operator ^. For filtering printable characters you could use the following expression [^\\x00-\\x7F] and you should get the desired result.
import java.io.UnsupportedEncodingException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class UTF8 {
public static void main(String[] args) {
String utf8tweet = "";
try {
byte[] utf8Bytes = "#Hello twitter  How are you?".getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Pattern unicodeOutliers = Pattern.compile("[^\\x00-\\x7F]",
Pattern.UNICODE_CASE | Pattern.CANON_EQ
| Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
System.out.println("Before: " + utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");
System.out.println("After: " + utf8tweet);
}
}
Results in the following output:
Before: #Hello twitter  How are you?
After: #Hello twitter How are you?
EDIT
To explain further, you could also keep expressing the range with the \u form in the following way [^\\u0000-\\u007F], which will match all the characters which are not the first 128 UNICODE characters (the same as before). If you want to extend the range to support extra characters, you can do so using the UNICODE character list here.
For example if you want to include vowels with accent (used in Spanish) you should extend the range to \u00FF, so you have [^\\u0000-\\u00FF] or [^\\x00-\\xFF]:
Before: #Hello twitter  How are you? á é í ó ú
After: #Hello twitter How are you? á é í ó ú
First of all, the unicode block concerned is specified in java (strictly following the standard) as Character.UnicodeBlock MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS. In a regex:
s = s.replaceAll("\\p{So}+", "");
I tried this. The unicode ranges are from emoji ranges
class EmojiEraser{
private static final String EMOJI_RANGE_REGEX =
"[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|[\u2600-\u26FF]|[\u2700-\u27BF]";
private static final Pattern PATTERN = Pattern.compile(EMOJI_RANGE_REGEX);
/**
* Finds and removes emojies from #param input
*
* #param input the input string potentially containing emojis (comes as unicode stringfied)
* #return input string with emojis replaced
*/
public String eraseEmojis(String input) {
if (Strings.isNullOrEmpty(input)) {
return input;
}
Matcher matcher = PATTERN.matcher(input);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
return sb.toString();
}
}
Assuming status.getText() returns a java.lang.String...
byte[] utf8Bytes = status.getText().getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
The above transcoding operation produces the same results as:
utf8tweet = status.getText();
Java strings are implicitly UTF-16. UTF-16 and UTF-8 share the same character set (Unicode) so transforming from one to the other and back results in the original data.
Java regular expressions support the supplementary range using surrogate pairs. You can match them as described in the answers to this question.
As eee notes in his comment, you most likely have a font issue. Whether a grapheme can be displayed usually depends on the fonts available on the user's system, the chosen font and what form of font substitution the rendering technology supports.

Related Links

Calculating Divisors & Outputting a String to a JLabel
Java Error: Exception in thread “Thread-5423” java.lang.ArrayIndexOutOfBoundsException [duplicate]
How can I host a LAN server on an Android device? [closed]
Best way to store sstring and translations in a program
Spring Boot to start with Oracle Configuration
Spring MVC AccessDeniedException 500 error received instead of custom 401 error for #PreAuthorized unauth requests
It is posible to expand a JTextArea or JTextPane by clicking on it?
Get full entities with Hibernate envers
Path Wont Close JavaFX
How to JTable print PageSetup?
Value of int doesn't change in for loop
Generic types in Java (Android)
Refactoring - Combine PolygonClickListener and MarkerClickListener
Combine JavaFX with Python
Extract a variable from a action performer(combobox) and use it in a query
Connecting to queue or creating on non-existence in spring-rabbitmq

Categories

HOME
hive
debugging
netbeans
mediawiki
iot
tesseract
xmpp
q
jira
malloc
sql-server-2016
retrofit
in-app-purchase
flask-wtforms
hapi
iron-router
clearcase-ucm
ef-migrations
code-review
ehcache
lcd
trading
crystal-reports-2010
buildbot
windows-error-reporting
wtx
unboundid
preg-match
data-manipulation
siesta-swift
jspresso
stringtemplate
amazon-kinesis-kpl
bytecode-manipulation
vapor
pdf-reactor
password-encryption
unixodbc
gpx
acoustics
xml-documentation
filepicker
komodoedit
statsd
android-cursor
knpmenubundle
arrow-keys
gcsfuse
titanium-android
gnome-shell-extensions
messenger
cubes
deadbolt-2
httplistener
azure-virtual-network
underscore.js-templating
ready-api
python-stackless
php-parse-error
two-factor-authentication
javax.sound.midi
pundit
coveralls
tableau-online
responsive-images
php-ci
vstest.console.exe
skos
directoryservices
ng-animate
android-nested-fragment
googlemock
onactivityresult
undefined-reference
mbr
html5-notifications
uv-mapping
pyhdf
c18
tidy
soundtouch
ora-00911
jquery-mobile-dialog
xamlparseexception
enterprisedb
wsdl-2.0
dmoz
path-separator
unc
dsn
parametric-equations
chrono
qtkit
bigcouch
libc++
stage
deobfuscation
html-input
yslow
dmx512
icanhaz.js
gin
kdbg
libs
avatar
msdev

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App