Zde je ukázkový kód, který můžete použít a který vygeneruje tento výstup:
Kód vypadá takto:
val df1=sc.parallelize(Seq((0,"v1"),(0,"v2"),(1,"v3"),(1,"v1"))).toDF("id","values")
val df2=sc.parallelize(Seq((0,"a1","b1","-"),(1,"a2","-","b2"))).toDF("id","v1","v2","v3")
val joinedDF=df1.join(df2,"id")
val resultDF=joinedDF.rdd.map{row=>
val id=row.getAs[Int]("id")
val values=row.getAs[String]("values")
val feilds=row.getAs[String](values)
(id,values,feilds)
}.toDF("id","values","feilds")
Během testování na konzoli:
scala> val df1=sc.parallelize(Seq((0,"v1"),(0,"v2"),(1,"v3"),(1,"v1"))).toDF("id","values")
df1: org.apache.spark.sql.DataFrame = [id: int, values: string]
scala> df1.show
+---+------+
| id|values|
+---+------+
| 0| v1|
| 0| v2|
| 1| v3|
| 1| v1|
+---+------+
scala> val df2=sc.parallelize(Seq((0,"a1","b1","-"),(1,"a2","-","b2"))).toDF("id","v1","v2","v3")
df2: org.apache.spark.sql.DataFrame = [id: int, v1: string ... 2 more fields]
scala> df2.show
+---+---+---+---+
| id| v1| v2| v3|
+---+---+---+---+
| 0| a1| b1| -|
| 1| a2| -| b2|
+---+---+---+---+
scala> val joinedDF=df1.join(df2,"id")
joinedDF: org.apache.spark.sql.DataFrame = [id: int, values: string ... 3 more fields]
scala> joinedDF.show
+---+------+---+---+---+
| id|values| v1| v2| v3|
+---+------+---+---+---+
| 1| v3| a2| -| b2|
| 1| v1| a2| -| b2|
| 0| v1| a1| b1| -|
| 0| v2| a1| b1| -|
+---+------+---+---+---+
scala> val resultDF=joinedDF.rdd.map{row=>
| val id=row.getAs[Int]("id")
| val values=row.getAs[String]("values")
| val feilds=row.getAs[String](values)
| (id,values,feilds)
| }.toDF("id","values","feilds")
resultDF: org.apache.spark.sql.DataFrame = [id: int, values: string ... 1 more field]
scala>
scala> resultDF.show
+---+------+------+
| id|values|feilds|
+---+------+------+
| 1| v3| b2|
| 1| v1| a2|
| 0| v1| a1|
| 0| v2| b1|
+---+------+------+
Doufám, že to může být váš problém. Dík!