java


Loading Customized Transformers in Apache Spark raised a NullPointerException


I have built a couple of customized Transformers and Estimators to meet the needs of my data and pipelines using Apache Spark version 2.0 in Java.
Briefly, one of my customized Transformers manipulates a couple of columns(aka pre-processing), based on some statistics practices. For these columns, I had defined the parameter inputCols as a StringArrayParam and implemented the interface HasInputCols, plus the interfaces MLWritable and MLReadable to derive the behavior of saving and loading my model for reusability's sake.
My intention with all this stuff was to create a Pipeline that fits a Dataset representing my data, and in a train phase it got produced a PipelineModel encapsulating all stages present in my pipeline as each one represents one Transformer chained in it.
Apparently there is no error when saving the derived PipelineModel as I inspected each saved/serialized stage and one of these stages is exactly my customized Transformer.
However, when I load the aforementioned serialized PipelineModel to my Apache Spark job responsible for data testing/transformation I got the infamous NullPointerException exception and I suspect that may be related either with the way I'm saving my customized Transformer or something deeper related with how Java interfaces with Scala, code reflection, once it seems Apache Spark is backed by Scala. I admit I have tried, inspected and tested uncountable times the code related with my Transformer, it's reader and writer, but I have been unable to solve this issue.
For those who would like to take a look or even discuss here, I'm pasting the stack trace of my Spark job when trying to load my PipelineModel, and the snippets of my customized models and it's Reader and Writer classes.
The stack trace containing the error:
17/01/24 06:50:58 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
17/01/24 06:50:58 INFO HadoopRDD: Input split: file:~/models/independent_gaussian_pipeline/stages/0_TxPreProcessingTransformer_fb7714b772b1/metadata/part-00000:0+243
17/01/24 06:50:58 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1164 bytes result sent to driver
17/01/24 06:50:58 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 42 ms on localhost (1/1)
17/01/24 06:50:58 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/01/24 06:50:58 INFO DAGScheduler: ResultStage 1 (first at ReadWrite.scala:391) finished in 0.044 s
17/01/24 06:50:58 INFO DAGScheduler: Job 1 finished: first at ReadWrite.scala:391, took 0.076893 s
Exception in thread "main" java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstance(ReadWrite.scala:447)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:267)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:265)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:265)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:341)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:335)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:227)
at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:325)
at org.apache.spark.ml.PipelineModel.load(Pipeline.scala)
at taka.pipelines.AnomalyTxPipeline.main(AnomalyTxPipeline.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
My customized model is called TxPreProcessingTransformer and here is it's code:
public class TxPreProcessingTransformer extends Transformer
implements MLWritable, MLReadable<TxPreProcessingTransformer>, HasInputCols {
private static final long serialVersionUID = 596263430109672895L;
private static final String
uidStr =
Identifiable$.MODULE$.randomUID("TxPreProcessingTransformer");
private StringArrayParam inputCols;
#Override
public Dataset transform(Dataset txs) {
txs.sqlContext().udf().register("absShiftLog10", new FeaturesPreProcessingTransformerUdf(100.0),
DataTypes.DoubleType);
for (String columnName : this.get(inputCols).get()) {
txs =
txs.withColumn(columnName, org.apache.spark.sql.functions
.callUDF("absShiftLog10", txs.col(columnName)));
}
return txs;
}
#Override
public StructType transformSchema(StructType structType) {
return structType;
}
#Override
public TxPreProcessingTransformer copy(ParamMap paramMap) {
return defaultCopy(paramMap);
}
#Override
public String uid() {
return uidStr;
}
#Override
public TxPreProcessingWriter write() {
return new TxPreProcessingWriter(this);
}
#Override
public void save(String path) throws IOException {
write().saveImpl(path);
}
#Override
public MLReader<TxPreProcessingTransformer> read() {
return new TxPreProcessingReader();
}
#Override
public TxPreProcessingTransformer load(String path) {
return ((TxPreProcessingReader)read()).load(path);
}
#Override
public void org$apache$spark$ml$param$shared$HasInputCols$_setter_$inputCols_$eq(
StringArrayParam stringArrayParam) {
this.inputCols = stringArrayParam;
}
#Override
public StringArrayParam inputCols() {
return new StringArrayParam(this, "inputCols", "Name of columns to be pre-processed");
}
#Override
public String[] getInputCols() {
return this.get(inputCols).get();
}
public TxPreProcessingTransformer setInputcols(String[] value) {
inputCols = inputCols();
return (TxPreProcessingTransformer) set(inputCols, value);
}
}
Here is the implementation of my transformer's reader:
public class TxPreProcessingReader extends MLReader<TxPreProcessingTransformer> {
private String className = TxPreProcessingTransformer.class.getName();
public String getClassName() {
return className;
}
public void setClassName(String className) {
this.className = className;
}
#Override
public TxPreProcessingTransformer load(String path) {
DefaultParamsReader.Metadata
metadata =
DefaultParamsReader$.MODULE$.loadMetadata(path, sc(), className);
String dataPath = new Path(path, "data").toString();
Row row = sparkSession().read().parquet(dataPath).select("inputCols").head();
List<String> listFeatureNames = row.getList(0);
String[] featureNames = new String[listFeatureNames.size()];
featureNames = listFeatureNames.toArray(featureNames);
TxPreProcessingTransformer
transformer =
new TxPreProcessingTransformer().setInputcols(featureNames);
DefaultParamsReader$.MODULE$.getAndSetParams(transformer, metadata);
return transformer;
}
}
And, bellow is my writer showing the way I'm saving the transformer:
public class TxPreProcessingWriter extends MLWriter {
private TxPreProcessingTransformer instance;
public TxPreProcessingWriter(TxPreProcessingTransformer instance) {
this.instance = instance;
}
public TxPreProcessingTransformer getInstance() {
return instance;
}
public void setInstance(TxPreProcessingTransformer instance) {
this.instance = instance;
}
#Override
public void saveImpl(String path) {
DefaultParamsWriter$.MODULE$.saveMetadata(instance, path, sc(), DefaultParamsWriter$.MODULE$
.getMetadataToSave$default$3(), DefaultParamsWriter$.MODULE$.getMetadataToSave$default$4());
Data data = new Data();
data.setInputCols(instance.getInputCols());
List<Data> listData = new ArrayList<>();
listData.add(data);
String dataPath = new Path(path, "data").toString();
sparkSession().createDataFrame(listData, Data.class).repartition(1).write().parquet(dataPath);
}
public static class Data implements Serializable {
private static final long serialVersionUID = -7753295698381203425L;
String[] inputCols;
public String[] getInputCols() {
return inputCols;
}
public void setInputCols(String[] inputCols) {
this.inputCols = inputCols;
}
}
}

Related Links

Java Substring Out of Bounds Error: Solution?
RequestMappingHandlerMapping is truncating path for RestController
How to click the snackbar button in Espresso testing?
JFrame displaying black screen with a little white rectangle in the top right
HTTPS to file in Apache Camel
Passing an array from a constructor to a method?
How to make Java Anagrams from txt
Connect Swift iOS client to Netty Server
How to structure maven build with javaee-api for SonarQube?
Is there any way to have embeddable class in Spring Data Neo4j 4
Using Double.POSITIVE_INFINITY in for loop (Java)
How to horizontally center JSpinner
Selenium + Java = FindElements (list) and check if some of them contains a specific text
Spring MVC pages are not accessible after adding mvc:resources
Does annotating a repository interface as #Component have any cons?
Read data from barcode

Categories

HOME
java
ibm-bluemix
pypi
amazon-ec2
heroku
single-sign-on
spagobi
getelementsbytagname
spring-cloud-stream
nano-server
rascal
export-to-csv
php-7.1
moonmail
uitypeeditor
undefined
nhibernate-envers
immutable.js
reverse-proxy
web-sql
vlsi
status
bootstrap-material-design
file-rename
flink-streaming
http-status-code-503
uisplitview
semantic-versioning
oracle-fusion-middleware
grails-3.1
kendo-ui-grid
android-kernel
nssegmentedcontrol
host
grid.mvc
core-plot
http-live-streaming
node-gyp
fakeiteasy
auto-update
jquery-validate
clean-architecture
python-webbrowser
jlink
snmptrapd
dwscript
smart-table
mediaelement
medium.com
xml-attribute
sdf
associative-array
cannon.js
computer-algebra-systems
bridge.net
sorl-thumbnail
phalanger
fpml
codeigniter-routing
markers
page-layout
dir
soundtouch
ruby-datamapper
pygit2
http-unit
free-variable
boost-filesystem
php-parser
xmlspy
recent-documents
calling-convention
blitz++
getresponsestream
nhibernate.search
firefox4
libs
eqatec
premature-optimization

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App