java


Loading Customized Transformers in Apache Spark raised a NullPointerException


I have built a couple of customized Transformers and Estimators to meet the needs of my data and pipelines using Apache Spark version 2.0 in Java.
Briefly, one of my customized Transformers manipulates a couple of columns(aka pre-processing), based on some statistics practices. For these columns, I had defined the parameter inputCols as a StringArrayParam and implemented the interface HasInputCols, plus the interfaces MLWritable and MLReadable to derive the behavior of saving and loading my model for reusability's sake.
My intention with all this stuff was to create a Pipeline that fits a Dataset representing my data, and in a train phase it got produced a PipelineModel encapsulating all stages present in my pipeline as each one represents one Transformer chained in it.
Apparently there is no error when saving the derived PipelineModel as I inspected each saved/serialized stage and one of these stages is exactly my customized Transformer.
However, when I load the aforementioned serialized PipelineModel to my Apache Spark job responsible for data testing/transformation I got the infamous NullPointerException exception and I suspect that may be related either with the way I'm saving my customized Transformer or something deeper related with how Java interfaces with Scala, code reflection, once it seems Apache Spark is backed by Scala. I admit I have tried, inspected and tested uncountable times the code related with my Transformer, it's reader and writer, but I have been unable to solve this issue.
For those who would like to take a look or even discuss here, I'm pasting the stack trace of my Spark job when trying to load my PipelineModel, and the snippets of my customized models and it's Reader and Writer classes.
The stack trace containing the error:
17/01/24 06:50:58 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
17/01/24 06:50:58 INFO HadoopRDD: Input split: file:~/models/independent_gaussian_pipeline/stages/0_TxPreProcessingTransformer_fb7714b772b1/metadata/part-00000:0+243
17/01/24 06:50:58 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1164 bytes result sent to driver
17/01/24 06:50:58 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 42 ms on localhost (1/1)
17/01/24 06:50:58 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/01/24 06:50:58 INFO DAGScheduler: ResultStage 1 (first at ReadWrite.scala:391) finished in 0.044 s
17/01/24 06:50:58 INFO DAGScheduler: Job 1 finished: first at ReadWrite.scala:391, took 0.076893 s
Exception in thread "main" java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstance(ReadWrite.scala:447)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:267)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:265)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:265)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:341)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:335)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:227)
at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:325)
at org.apache.spark.ml.PipelineModel.load(Pipeline.scala)
at taka.pipelines.AnomalyTxPipeline.main(AnomalyTxPipeline.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
My customized model is called TxPreProcessingTransformer and here is it's code:
public class TxPreProcessingTransformer extends Transformer
implements MLWritable, MLReadable<TxPreProcessingTransformer>, HasInputCols {
private static final long serialVersionUID = 596263430109672895L;
private static final String
uidStr =
Identifiable$.MODULE$.randomUID("TxPreProcessingTransformer");
private StringArrayParam inputCols;
#Override
public Dataset transform(Dataset txs) {
txs.sqlContext().udf().register("absShiftLog10", new FeaturesPreProcessingTransformerUdf(100.0),
DataTypes.DoubleType);
for (String columnName : this.get(inputCols).get()) {
txs =
txs.withColumn(columnName, org.apache.spark.sql.functions
.callUDF("absShiftLog10", txs.col(columnName)));
}
return txs;
}
#Override
public StructType transformSchema(StructType structType) {
return structType;
}
#Override
public TxPreProcessingTransformer copy(ParamMap paramMap) {
return defaultCopy(paramMap);
}
#Override
public String uid() {
return uidStr;
}
#Override
public TxPreProcessingWriter write() {
return new TxPreProcessingWriter(this);
}
#Override
public void save(String path) throws IOException {
write().saveImpl(path);
}
#Override
public MLReader<TxPreProcessingTransformer> read() {
return new TxPreProcessingReader();
}
#Override
public TxPreProcessingTransformer load(String path) {
return ((TxPreProcessingReader)read()).load(path);
}
#Override
public void org$apache$spark$ml$param$shared$HasInputCols$_setter_$inputCols_$eq(
StringArrayParam stringArrayParam) {
this.inputCols = stringArrayParam;
}
#Override
public StringArrayParam inputCols() {
return new StringArrayParam(this, "inputCols", "Name of columns to be pre-processed");
}
#Override
public String[] getInputCols() {
return this.get(inputCols).get();
}
public TxPreProcessingTransformer setInputcols(String[] value) {
inputCols = inputCols();
return (TxPreProcessingTransformer) set(inputCols, value);
}
}
Here is the implementation of my transformer's reader:
public class TxPreProcessingReader extends MLReader<TxPreProcessingTransformer> {
private String className = TxPreProcessingTransformer.class.getName();
public String getClassName() {
return className;
}
public void setClassName(String className) {
this.className = className;
}
#Override
public TxPreProcessingTransformer load(String path) {
DefaultParamsReader.Metadata
metadata =
DefaultParamsReader$.MODULE$.loadMetadata(path, sc(), className);
String dataPath = new Path(path, "data").toString();
Row row = sparkSession().read().parquet(dataPath).select("inputCols").head();
List<String> listFeatureNames = row.getList(0);
String[] featureNames = new String[listFeatureNames.size()];
featureNames = listFeatureNames.toArray(featureNames);
TxPreProcessingTransformer
transformer =
new TxPreProcessingTransformer().setInputcols(featureNames);
DefaultParamsReader$.MODULE$.getAndSetParams(transformer, metadata);
return transformer;
}
}
And, bellow is my writer showing the way I'm saving the transformer:
public class TxPreProcessingWriter extends MLWriter {
private TxPreProcessingTransformer instance;
public TxPreProcessingWriter(TxPreProcessingTransformer instance) {
this.instance = instance;
}
public TxPreProcessingTransformer getInstance() {
return instance;
}
public void setInstance(TxPreProcessingTransformer instance) {
this.instance = instance;
}
#Override
public void saveImpl(String path) {
DefaultParamsWriter$.MODULE$.saveMetadata(instance, path, sc(), DefaultParamsWriter$.MODULE$
.getMetadataToSave$default$3(), DefaultParamsWriter$.MODULE$.getMetadataToSave$default$4());
Data data = new Data();
data.setInputCols(instance.getInputCols());
List<Data> listData = new ArrayList<>();
listData.add(data);
String dataPath = new Path(path, "data").toString();
sparkSession().createDataFrame(listData, Data.class).repartition(1).write().parquet(dataPath);
}
public static class Data implements Serializable {
private static final long serialVersionUID = -7753295698381203425L;
String[] inputCols;
public String[] getInputCols() {
return inputCols;
}
public void setInputCols(String[] inputCols) {
this.inputCols = inputCols;
}
}
}

Related Links

Android Bug or Java Bug : When execute callback find the wrong instance
Renaming “java” directory to “kotlin” in Android Studio
Auto-upload generated documentation on Github
SpringBoot and Ajax responses
PDFBox - opening and saving a signed pdf invalidates my signature
Spring: annotation-config base-package
Query regarding a use case in java multi-threading
Java Regex String#replaceAll Alternative
Android home button arrow direction
Find number of Bytes read from Selenium Webdriver
Why is DataFetcher not called in this GraphQL setup?
Java AsyncContext Timeout (long-pooling)
Error : java.sql.SQLException: JZ0P1: Unexpected result type
Attempt to invoke virtual method on a null object reference issue [duplicate]
Get auth token doesn't work in account manager
commons-math differentiation result is 0

Categories

HOME
cakephp
pandas
oracle11g
mockito
dictionary
google-play
nullpointerexception
fft
fme
octobercms
setup-deployment
paw-app
retrofit
wamp
applepay
slurm
offline
wheelnav.js
imacros
google-cloud-spanner
quartz-scheduler
tostring
http-status-code-504
contextmenu
visual-composer
scichart
css-animations
autosys
icloud-api
trading
maxmind
google-rich-snippets
stormpath
catel
perlin-noise
commit
phpfreechat
form-data
amazon-kinesis-kpl
mpmediaquery
environment-modules
android-ble
adobe-premiere
greenrobot-eventbus
http-live-streaming
no-www
segment
taffy
azure-application-gateway
revapi
jedis
clean-architecture
aurelia-validation
statsd
pillow
zendesk-app
django-scheduler
account-kit
login-control
google-web-starter-kit
fputcsv
rvest
ipconfig
asp.net5
msys
wireshark-dissector
wyam
skos
gwidgets
gadt
google-style-guide
relocation
file-copying
javafx-webengine
xcode-6.2
typeof
doskey
mechanize-ruby
mcts
xsockets.net
dir
gwt-rpc
robotics-studio
manchester-syntax
libc++
tomcat-valve
horizontal-accordion
table-footer
pyinotify
telerik-scheduler
gamma
ext3
.net-1.0

Resources

Database Users
RDBMS discuss
Database Dev&Adm
javascript
java
csharp
php
android
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App