java


removing characters of a specific unicode range from a string


I have a program that is parsting tweets in real time from the twitter stream api. Before storing them, I am encoding them as utf8. Certain characters end up appearing in the string as ?, ??, or ??? instead of their respective unicode codes and cause problems. Upon further investigation, I found that the problematic characters are from the "emoticon" block, U+1F600 - U+1F64F, and the "Miscellaneous Symbols And Pictographs" block, U+1F300 - U+1F5FF. I tried removing, but was unsuccessful as the matcher ended up replacing almost every character in the string, not just my desired unicode range.
String utf8tweet = "";
try {
byte[] utf8Bytes = status.getText().getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Pattern unicodeOutliers = Pattern.compile("[\\u1f300-\\u1f64f]", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");
What can I do to remove these characters?
In the regex pattern add the negation operator ^. For filtering printable characters you could use the following expression [^\\x00-\\x7F] and you should get the desired result.
import java.io.UnsupportedEncodingException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class UTF8 {
public static void main(String[] args) {
String utf8tweet = "";
try {
byte[] utf8Bytes = "#Hello twitter  How are you?".getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Pattern unicodeOutliers = Pattern.compile("[^\\x00-\\x7F]",
Pattern.UNICODE_CASE | Pattern.CANON_EQ
| Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
System.out.println("Before: " + utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");
System.out.println("After: " + utf8tweet);
}
}
Results in the following output:
Before: #Hello twitter  How are you?
After: #Hello twitter How are you?
EDIT
To explain further, you could also keep expressing the range with the \u form in the following way [^\\u0000-\\u007F], which will match all the characters which are not the first 128 UNICODE characters (the same as before). If you want to extend the range to support extra characters, you can do so using the UNICODE character list here.
For example if you want to include vowels with accent (used in Spanish) you should extend the range to \u00FF, so you have [^\\u0000-\\u00FF] or [^\\x00-\\xFF]:
Before: #Hello twitter  How are you? á é í ó ú
After: #Hello twitter How are you? á é í ó ú
First of all, the unicode block concerned is specified in java (strictly following the standard) as Character.UnicodeBlock MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS. In a regex:
s = s.replaceAll("\\p{So}+", "");
I tried this. The unicode ranges are from emoji ranges
class EmojiEraser{
private static final String EMOJI_RANGE_REGEX =
"[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|[\u2600-\u26FF]|[\u2700-\u27BF]";
private static final Pattern PATTERN = Pattern.compile(EMOJI_RANGE_REGEX);
/**
* Finds and removes emojies from #param input
*
* #param input the input string potentially containing emojis (comes as unicode stringfied)
* #return input string with emojis replaced
*/
public String eraseEmojis(String input) {
if (Strings.isNullOrEmpty(input)) {
return input;
}
Matcher matcher = PATTERN.matcher(input);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
return sb.toString();
}
}
Assuming status.getText() returns a java.lang.String...
byte[] utf8Bytes = status.getText().getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
The above transcoding operation produces the same results as:
utf8tweet = status.getText();
Java strings are implicitly UTF-16. UTF-16 and UTF-8 share the same character set (Unicode) so transforming from one to the other and back results in the original data.
Java regular expressions support the supplementary range using surrogate pairs. You can match them as described in the answers to this question.
As eee notes in his comment, you most likely have a font issue. Whether a grapheme can be displayed usually depends on the fonts available on the user's system, the chosen font and what form of font substitution the rendering technology supports.

Related Links

Android AdjustResize not resizing correctly
How to add hierarchical structure to cucumber step definition calls?
Unable to find a MessageBodyReader of content-type application/json and type int
JavaFX: apply transformations to node layout bounds
Sending .jpg from Android to c++ using sockets
Make a TextView stop scrolling smoothly
How to resolve dependencies in play framework project?
“Custom” element and directory isses with CSS
Java Api Speechmatics
Eclipse rename EditorPart
AbstractHandlerExceptionResolver Custom resolver not getting called for Spring oauth/token url
InsertionSort on LinkedList data structure
How to add a .class file folder to the classpath in eclim?
GPS ON/OFF programatically [duplicate]
Search among Lucene pre-indexing files using PHP?
How to rebuild elasticsearch index when using spring-data-elasticsearch

Categories

HOME
testing
arduino-uno
amazon-ec2
omnet++
c#-4.0
deezer
tizen
kalman-filter
bookshelf.js
at-command
filtering
spring-cloud-stream
amazon-ecs
jsrender
mouse
node-notifier
facebook-messenger-bot
vifm
timeout
hapi
seaborn
emulator
riot.js
pass-by-reference
kudan
sqlcipher
underflow
chromebook
nhapi
media-queries
yadcf
google-cloud-endpoints-v2
semantic-versioning
http-get
webtest
geopositioning
normal-distribution
scaffold
swisscomdev
web-mining
c11
ansible-playbook
ws-security
s
galleria
smartcontracts
atomicity
gabor-filter
mplayer
csound
nodebb
windows-mobile-6.5
namecoin
bind9
vtigercrm
word-vba-mac
google-feed-api
ctest
firebaseui
clang-static-analyzer
blackberry-10
testng-dataprovider
bgp
rhino-servicebus
qtableview
google-places
jmeter-maven-plugin
iis-arr
mmc
iiviewdeckcontroller
sorl-thumbnail
block-device
coldbox
android-imagebutton
file-copying
viadeo
terminal-services
sitemesh
odata4j
buster.js
mylyn
seed
random-seed
sqlperformance
lcs
angularjs-controller
automount
spring-portlet-mvc
punbb
tomcat-valve
datareader
gamma
firefox-5
meego
firefox4
microsoft-virtualization
avatar
post-build
mdac
ntvdm.exe

Resources

Database Users
RDBMS discuss
Database Dev&Adm
javascript
java
csharp
php
android
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App