java


removing characters of a specific unicode range from a string


I have a program that is parsting tweets in real time from the twitter stream api. Before storing them, I am encoding them as utf8. Certain characters end up appearing in the string as ?, ??, or ??? instead of their respective unicode codes and cause problems. Upon further investigation, I found that the problematic characters are from the "emoticon" block, U+1F600 - U+1F64F, and the "Miscellaneous Symbols And Pictographs" block, U+1F300 - U+1F5FF. I tried removing, but was unsuccessful as the matcher ended up replacing almost every character in the string, not just my desired unicode range.
String utf8tweet = "";
try {
byte[] utf8Bytes = status.getText().getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Pattern unicodeOutliers = Pattern.compile("[\\u1f300-\\u1f64f]", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");
What can I do to remove these characters?
In the regex pattern add the negation operator ^. For filtering printable characters you could use the following expression [^\\x00-\\x7F] and you should get the desired result.
import java.io.UnsupportedEncodingException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class UTF8 {
public static void main(String[] args) {
String utf8tweet = "";
try {
byte[] utf8Bytes = "#Hello twitter  How are you?".getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Pattern unicodeOutliers = Pattern.compile("[^\\x00-\\x7F]",
Pattern.UNICODE_CASE | Pattern.CANON_EQ
| Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
System.out.println("Before: " + utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");
System.out.println("After: " + utf8tweet);
}
}
Results in the following output:
Before: #Hello twitter  How are you?
After: #Hello twitter How are you?
EDIT
To explain further, you could also keep expressing the range with the \u form in the following way [^\\u0000-\\u007F], which will match all the characters which are not the first 128 UNICODE characters (the same as before). If you want to extend the range to support extra characters, you can do so using the UNICODE character list here.
For example if you want to include vowels with accent (used in Spanish) you should extend the range to \u00FF, so you have [^\\u0000-\\u00FF] or [^\\x00-\\xFF]:
Before: #Hello twitter  How are you? á é í ó ú
After: #Hello twitter How are you? á é í ó ú
First of all, the unicode block concerned is specified in java (strictly following the standard) as Character.UnicodeBlock MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS. In a regex:
s = s.replaceAll("\\p{So}+", "");
I tried this. The unicode ranges are from emoji ranges
class EmojiEraser{
private static final String EMOJI_RANGE_REGEX =
"[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|[\u2600-\u26FF]|[\u2700-\u27BF]";
private static final Pattern PATTERN = Pattern.compile(EMOJI_RANGE_REGEX);
/**
* Finds and removes emojies from #param input
*
* #param input the input string potentially containing emojis (comes as unicode stringfied)
* #return input string with emojis replaced
*/
public String eraseEmojis(String input) {
if (Strings.isNullOrEmpty(input)) {
return input;
}
Matcher matcher = PATTERN.matcher(input);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
return sb.toString();
}
}
Assuming status.getText() returns a java.lang.String...
byte[] utf8Bytes = status.getText().getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
The above transcoding operation produces the same results as:
utf8tweet = status.getText();
Java strings are implicitly UTF-16. UTF-16 and UTF-8 share the same character set (Unicode) so transforming from one to the other and back results in the original data.
Java regular expressions support the supplementary range using surrogate pairs. You can match them as described in the answers to this question.
As eee notes in his comment, you most likely have a font issue. Whether a grapheme can be displayed usually depends on the fonts available on the user's system, the chosen font and what form of font substitution the rendering technology supports.

Related Links

Why only String[] args in java instead of Object[] args?
I'm not sure how to add spaces to the small triangles to make it into a larger triangle
Credit Card validator for java
Unable to cancel PendingIntent from AlarmManager
How to download csv file without writing it using OpenCsv
java.lang.ClassCastException: Creating a synchronized Linked List
Your method must return a value. If your method has multiple paths of execution
Java Add Object to ArrayList error
Using Java 8, what is the most concise way of creating a sorted AND grouped list of Strings
UPDATED: Reading text file into a byte array
Doing an ActionRequest call from JS/jQuery/Ajax in Spring MVC
how does following for loop work ? for (String temp : uniqueSet) [duplicate]
Genereic class and static generic method
Determine number of completed tasks in ExecutorCompletionService queue
How to transform Iterable<Map.Entry<A,B>> to Map<A,B> using Guava?
Multithreading: wait() and notify() in Java WITHOUT checkpoints everywhere?

Categories

HOME
xamarin
webpack
include
lodash
react-router
cvs
rubygems
quicklook
modelica
dtrace
ibm-odm
decomposition
decimal
dynamics-crm-online
django-admin
finite-automata
tokenize
tibco-mdm
sparse-matrix
fgetcsv
jaxb2
libuv
webkitspeechrecognition
google-cloud-nl
tasklet
microsoft-chart-controls
media-queries
directx-10
xquery-3.0
framemaker
bower-install
host
git-merge
bitbucket-pipelines
libraries
spring-security-kerberos
modelmapper
flickr-api
qsslsocket
ionicons
typescript1.8
domain-model
csound
knockout-components
crypt
pg-dump
libusb-win32
python-cryptography
gcsfuse
spring-android
vmware-tools
impresspages
reactive-banana
execute
citrus-pay
notify
streambase
lustre
directoryservices
bigbluebutton
npapi
gwidgets
tablelayout
tt-news
network-interface
codeigniter-routing
asp.net-web-api-odata
article
comexception
page-layout
web2py-modules
listings
padarn
django-nonrel
xamlparseexception
enterprisedb
libstdc++
mt
jspinclude
gnustep
tomcat-valve
netbeans-7.1
android-sdk-2.1
telerik-scheduler
createwindow
vc90
thunderbird-lightning
premature-optimization
thread-local-storage
zune
.net-1.0
lzh

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App