java


removing characters of a specific unicode range from a string


I have a program that is parsting tweets in real time from the twitter stream api. Before storing them, I am encoding them as utf8. Certain characters end up appearing in the string as ?, ??, or ??? instead of their respective unicode codes and cause problems. Upon further investigation, I found that the problematic characters are from the "emoticon" block, U+1F600 - U+1F64F, and the "Miscellaneous Symbols And Pictographs" block, U+1F300 - U+1F5FF. I tried removing, but was unsuccessful as the matcher ended up replacing almost every character in the string, not just my desired unicode range.
String utf8tweet = "";
try {
byte[] utf8Bytes = status.getText().getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Pattern unicodeOutliers = Pattern.compile("[\\u1f300-\\u1f64f]", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");
What can I do to remove these characters?
In the regex pattern add the negation operator ^. For filtering printable characters you could use the following expression [^\\x00-\\x7F] and you should get the desired result.
import java.io.UnsupportedEncodingException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class UTF8 {
public static void main(String[] args) {
String utf8tweet = "";
try {
byte[] utf8Bytes = "#Hello twitter  How are you?".getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Pattern unicodeOutliers = Pattern.compile("[^\\x00-\\x7F]",
Pattern.UNICODE_CASE | Pattern.CANON_EQ
| Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
System.out.println("Before: " + utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");
System.out.println("After: " + utf8tweet);
}
}
Results in the following output:
Before: #Hello twitter  How are you?
After: #Hello twitter How are you?
EDIT
To explain further, you could also keep expressing the range with the \u form in the following way [^\\u0000-\\u007F], which will match all the characters which are not the first 128 UNICODE characters (the same as before). If you want to extend the range to support extra characters, you can do so using the UNICODE character list here.
For example if you want to include vowels with accent (used in Spanish) you should extend the range to \u00FF, so you have [^\\u0000-\\u00FF] or [^\\x00-\\xFF]:
Before: #Hello twitter  How are you? á é í ó ú
After: #Hello twitter How are you? á é í ó ú
First of all, the unicode block concerned is specified in java (strictly following the standard) as Character.UnicodeBlock MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS. In a regex:
s = s.replaceAll("\\p{So}+", "");
I tried this. The unicode ranges are from emoji ranges
class EmojiEraser{
private static final String EMOJI_RANGE_REGEX =
"[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|[\u2600-\u26FF]|[\u2700-\u27BF]";
private static final Pattern PATTERN = Pattern.compile(EMOJI_RANGE_REGEX);
/**
* Finds and removes emojies from #param input
*
* #param input the input string potentially containing emojis (comes as unicode stringfied)
* #return input string with emojis replaced
*/
public String eraseEmojis(String input) {
if (Strings.isNullOrEmpty(input)) {
return input;
}
Matcher matcher = PATTERN.matcher(input);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
return sb.toString();
}
}
Assuming status.getText() returns a java.lang.String...
byte[] utf8Bytes = status.getText().getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
The above transcoding operation produces the same results as:
utf8tweet = status.getText();
Java strings are implicitly UTF-16. UTF-16 and UTF-8 share the same character set (Unicode) so transforming from one to the other and back results in the original data.
Java regular expressions support the supplementary range using surrogate pairs. You can match them as described in the answers to this question.
As eee notes in his comment, you most likely have a font issue. Whether a grapheme can be displayed usually depends on the fonts available on the user's system, the chosen font and what form of font substitution the rendering technology supports.

Related Links

How to prevent map hacking in MMO
HikariCP too many connections
.classpath file in Java related to Java or Eclipse [closed]
Compilation (genericity) issues overriding Properties.putAll
case statement in order by clause is not working in HQL
Java class extends from multiple classes
Move a window when a window moves?
page can't be displayed when redirect to https
Auto_increment ID in Java-Sqlite code
do while loop with input scanner [duplicate]
Call java class from a jar file in python easily without another complicated program like Py4j [duplicate]
If final object is being passed, should null still be checked?
Create ArrayAdapter<String> from SimpleAdapter
Java Android, listing multiple stored values in settext
Java Multithread Priorities
cannot resolve getBaseContext() in android

Categories

HOME
java
jdo
coq
magnific-popup
plone
json.net
bpmn
getelementsbytagname
razor
filtering
leon
electronics
jpeg
jxls
podio
portia
export-to-csv
jplayer
vb.net-2010
iolanguage
viewport
introduction
jtextfield
autoconf
clickonce
jasonette
dcevm
cultureinfo
intel-pin
info.plist
elasticsearch-net
galsim
data-manipulation
airconsole
titanium-mobile
lxd
fusionpbx
scaffold
jna
greenrobot-eventbus
user-controls
janrain
jvm-languages
nand2tetris
scorm
xml-documentation
pubmed
hls.js
dotnetzip
aurelia-validation
tactic
worker-thread
epson
deadbolt-2
httplistener
infix-notation
multi-level
fody
fuzzy-search
contact-list
rvest
suffix-tree
freedesktop.org
javax.sound.midi
cyclomatic-complexity
metaclass
mutation-observers
computer-algebra-systems
rgeo
varargs
twirl
interrupted-exception
fluid-dynamics
npapi
system.reflection
network-interface
gui-test-framework
neolane
terminfo
google-reader
android-radiobutton
hippomocks
ccss
access-rights
cdc
odata4j
flashvars
ocunit
delphi-6
path-separator
sublist
libstdc++
first-responder
propertyeditor
viewswitcher
onsubmit
nsdatecomponents
hirefire
gcj
psi
cinema-4d
calling-convention
yslow
gallio
.nettiers
temporal-database
anti-piracy
aquaticprime

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App