python - Split RDD for K-fold validation: pyspark -


i have dataset , want apply naive bayes on that. validating using k-fold technique. data has 2 classes , ordered i.e. if data set has 100 rows, first 50 of 1 class , next 50 of second class. hence, first want shuffle data , randomly form k-folds. problem when trying randomsplit on rdd, creating rdds of different sizes. code , example of dataset follows:

documentdf = sqlcontext.createdataframe([     (0,"this cat".lower().split(" "), ),     (0,"this dog".lower().split(" "), ),     (0,"this pig".lower().split(" "), ),     (0,"this mouse".lower().split(" "), ),     (0,"this donkey".lower().split(" "), ),     (0,"this monkey".lower().split(" "), ),     (0,"this horse".lower().split(" "), ),     (0,"this goat".lower().split(" "), ),     (0,"this tiger".lower().split(" "), ),     (0,"this lion".lower().split(" "), ),     (1,"a mouse , pig friends".lower().split(" "), ),     (1,"a pig , dog friends".lower().split(" "), ),     (1,"a mouse , cat friends".lower().split(" "), ),     (1,"a lion , tiger friends".lower().split(" "), ),     (1,"a lion , goat friends".lower().split(" "), ),     (1,"a monkey , goat friends".lower().split(" "), ),     (1,"a monkey , donkey friends".lower().split(" "), ),     (1,"a horse , donkey friends".lower().split(" "), ),     (1,"a horse , tiger friends".lower().split(" "), ),     (1,"a cat , dog friends".lower().split(" "), ) ], ["label","text"])  pyspark.mllib.classification import naivebayes, naivebayesmodel pyspark.mllib.linalg import vectors pyspark.ml.feature import countvectorizer pyspark.mllib.regression import labeledpoint  def mapper_vector(x):     row = x.text     return labeledpoint(x.label,row)  splitsize = [0.2]*5 print("splitsize"+str(splitsize)) print(sum(splitsize)) vect = documentdf.map(lambda x: mapper_vector(x)) splits = vect.randomsplit(splitsize, seed=0)  print("***********splits**************") in range(len(splits)):     print("split"+str(i)+":"+str(len(splits[i].collect()))) 

this code outputs:

splitsize[0.2, 0.2, 0.2, 0.2, 0.2] 1.0 ***********splits************** split0:1 split1:5 split2:3 split3:5 split4:6 

the documentdf had 20 rows, wanted 5 distinct exclusive samples dataset have same size. however, can seen splits have different sizes. doing wrong?

edit: according zero323 not doing wrong. then, if want final results(as described) without using ml crossvalidator, need change? also, why numbers different? if each split has equal weightage, aren't supposed have equal number of rows? also, there other way randomize data?

you're not doing wrong. randomsplit doesn't provide hard guarantees regarding data distribution. using bernoullicellsampler (see how sparks rdd.randomsplit split rdd) , exact fractions can differ run run. normal behavior , should acceptable on real size data set differences should statistically insignificant.

on side not spark ml provides crossvalidator can used ml pipelines (see how cross validate randomforest model? example usage).


Comments

Popular posts from this blog

Ansible - ERROR! the field 'hosts' is required but was not set -

customize file_field button ruby on rails -

SoapUI on windows 10 - high DPI/4K scaling issue -