java


removing characters of a specific unicode range from a string


I have a program that is parsting tweets in real time from the twitter stream api. Before storing them, I am encoding them as utf8. Certain characters end up appearing in the string as ?, ??, or ??? instead of their respective unicode codes and cause problems. Upon further investigation, I found that the problematic characters are from the "emoticon" block, U+1F600 - U+1F64F, and the "Miscellaneous Symbols And Pictographs" block, U+1F300 - U+1F5FF. I tried removing, but was unsuccessful as the matcher ended up replacing almost every character in the string, not just my desired unicode range.
String utf8tweet = "";
try {
byte[] utf8Bytes = status.getText().getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Pattern unicodeOutliers = Pattern.compile("[\\u1f300-\\u1f64f]", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");
What can I do to remove these characters?
In the regex pattern add the negation operator ^. For filtering printable characters you could use the following expression [^\\x00-\\x7F] and you should get the desired result.
import java.io.UnsupportedEncodingException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class UTF8 {
public static void main(String[] args) {
String utf8tweet = "";
try {
byte[] utf8Bytes = "#Hello twitter  How are you?".getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Pattern unicodeOutliers = Pattern.compile("[^\\x00-\\x7F]",
Pattern.UNICODE_CASE | Pattern.CANON_EQ
| Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
System.out.println("Before: " + utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");
System.out.println("After: " + utf8tweet);
}
}
Results in the following output:
Before: #Hello twitter  How are you?
After: #Hello twitter How are you?
EDIT
To explain further, you could also keep expressing the range with the \u form in the following way [^\\u0000-\\u007F], which will match all the characters which are not the first 128 UNICODE characters (the same as before). If you want to extend the range to support extra characters, you can do so using the UNICODE character list here.
For example if you want to include vowels with accent (used in Spanish) you should extend the range to \u00FF, so you have [^\\u0000-\\u00FF] or [^\\x00-\\xFF]:
Before: #Hello twitter  How are you? á é í ó ú
After: #Hello twitter How are you? á é í ó ú
First of all, the unicode block concerned is specified in java (strictly following the standard) as Character.UnicodeBlock MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS. In a regex:
s = s.replaceAll("\\p{So}+", "");
I tried this. The unicode ranges are from emoji ranges
class EmojiEraser{
private static final String EMOJI_RANGE_REGEX =
"[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|[\u2600-\u26FF]|[\u2700-\u27BF]";
private static final Pattern PATTERN = Pattern.compile(EMOJI_RANGE_REGEX);
/**
* Finds and removes emojies from #param input
*
* #param input the input string potentially containing emojis (comes as unicode stringfied)
* #return input string with emojis replaced
*/
public String eraseEmojis(String input) {
if (Strings.isNullOrEmpty(input)) {
return input;
}
Matcher matcher = PATTERN.matcher(input);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
return sb.toString();
}
}
Assuming status.getText() returns a java.lang.String...
byte[] utf8Bytes = status.getText().getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
The above transcoding operation produces the same results as:
utf8tweet = status.getText();
Java strings are implicitly UTF-16. UTF-16 and UTF-8 share the same character set (Unicode) so transforming from one to the other and back results in the original data.
Java regular expressions support the supplementary range using surrogate pairs. You can match them as described in the answers to this question.
As eee notes in his comment, you most likely have a font issue. Whether a grapheme can be displayed usually depends on the fonts available on the user's system, the chosen font and what form of font substitution the rendering technology supports.

Related Links

is -Xmx a hard limit?
How can I parse different node name xml
How can I send a multipart form-data PUT request with apache
Next/Prev buttons to change object from list using Spring MVC and thymeleaf
Java edit book, loan book and return book methods
com.android.dx.cf.iface.ParseException: bad class file magic (cafebabe) or version (0034.0000)
Well Formed String using Stack and HashMap
ParseQuery - Properly handling IndexOutOfBoundsException - Parse.com?
org.hibernate.hql.internal.ast.QuerySyntaxException: Path expected for join
Why is spring not injecting using generic qualifiers?
How does the values of an array changes when we passed it as an argument to other function? [duplicate]
Need to implement web socket client with environment running on java 6
JavaFX: How to fill a ComboBox with changing values and refrech it
How to create an X pattern in Java given length and width from user?
Implement static nested class' interface in outer class
JAVA Find the biggest in difference pair from text file [closed]

Categories

HOME
vim
deployment
atom-editor
pypi
isabelle
smarty
hashmap
fft
raspberry-pi
grep
react-redux
yarn
jxls
networkx
bs4
adfs
gitpitch
iggrid
wheelnav.js
export-to-csv
quickbooks
windows-phone-7
try-catch
vaadin7
google-static-maps
connection-string
oxyplot
nhibernate-envers
viewport
lldb
kryo
xlsxwriter
opencover
google-search-api
chromium-embedded
wijmo
one-to-many
fgetcsv
windows-server-2000
bpel
r-raster
twilio-api
http-get
column-family
kendo-ui-grid
pim
vsts-build-task
hybridauth
asset-pipeline
xmlreader
force-layout
mpmediaquery
ruby-on-rails-3.1
pdf-reactor
total-commander
gpx
s
lowpass-filter
jxcore
diagnostics
windows-iot-core-10
ifs
mcafee
blacklist
topbeat
abcpdf9
passport-google-oauth
azure-virtual-network
transmitfile
pickadate
ios4
historian
pervasive-sql
intellij-14
ibaction
microbenchmark
arcanist
streambase
apache-commons-net
offloading
operator-precedence
ng-animate
phalanger
asp.net-web-api-odata
flask-cors
ceil
xceed-datagrid
xsockets.net
dataadapter
padarn
orchardcms-1.7
sharp-repository
google-email-migration
broken-links
datawindow
dsn
hosts-file
objective-c-2.0
gnu-prolog
mod-auth
invite
visitor-statistic
fluent-interface
dmx512
gin
asp.net-profiles
avatar
simpletest
premature-optimization
scripting-languages
ubuntu-9.04
caching-application-block

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App