Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
182 views
in Technique[技术] by (71.8m points)

python 3.x - pyspark dataframes: Why can I select some nested fields but not others?

I'm trying to write some code to un-nest JSON into Dataframes using pyspark (3.0.1) in Python 3.9.1.

I have some dummy data with a schema as follows:

data.printSchema()
root
 |-- recordID: string (nullable = true)
 |-- customerDetails: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- dob: string (nullable = true)
 |-- familyMembers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- relationship: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- contactNumbers: struct (nullable = true)
 |    |    |    |-- work: string (nullable = true)
 |    |    |    |-- home: string (nullable = true)
 |    |    |-- addressDetails: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- addressType: string (nullable = true)
 |    |    |    |    |-- address: string (nullable = true)

When I select fields from familyMembers I get the following results as expected:

data.select('familyMembers.contactNumbers.work').show(truncate=False)
+------------------------------------------------+
|work                                            |
+------------------------------------------------+
|[(07) 4612 3880, (03) 5855 2377, (07) 4979 1871]|
|[(07) 4612 3880, (03) 5855 2377]                |
+------------------------------------------------+

data.select('familyMembers.name').show(truncate=False)
+------------------------------------+
|name                                |
+------------------------------------+
|[Jane Smith, Bob Smith, Simon Smith]|
|[Jackie Sacamano, Simon Sacamano]   |
+------------------------------------+

Yet when I try to select fields from the addressDetails ArrayType (beneath familyMembers) I get an error:

>>> data.select('familyMembers.addressDetails.address').show(truncate=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 1421, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File "/usr/local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/usr/local/lib/python3.9/site-packages/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: cannot resolve '`familyMembers`.`addressDetails`['address']' due to data type mismatch: argument 2 requires integral type, however, ''address'' is of string type.;;
'Project [familyMembers#71.addressDetails[address] AS address#277]
+- LogicalRDD [recordID#69, customerDetails#70, familyMembers#71], false

I'm confused. Both familyMembers and addressDetails are ArrayTypes, yet selecting from one works but not the other. Is there an explanation for this, or something I've missed? Is it because one is nested within the other?

Code to reproduce (with just 1 record):

from pyspark.sql.types import StructType
from pyspark.sql import SparkSession, DataFrame
import json

rawdata = [{"recordID":"abc-123","customerDetails":{"name":"John Smith","dob":"1980-04-23"},"familyMembers":[{"relationship":"mother","name":"Jane Smith","contactNumbers":{"work":"(07) 4612 3880","home":"(08) 8271 1577"},"addressDetails":[{"addressType":"residential","address":"29 Commonwealth St, Clifton, QLD 4361 "},{"addressType":"work","address":"20 A Yeo Ave, Highgate, SA 5063 "}]},{"relationship":"father","name":"Bob Smith","contactNumbers":{"work":"(03) 5855 2377","home":"(03) 9773 2483"},"addressDetails":[{"addressType":"residential","address":"1735 Fenaughty Rd, Kyabram South, VIC 3620"},{"addressType":"work","address":"12 Haldane St, Bonbeach, VIC 3196 "}]},{"relationship":"brother","name":"Simon Smith","contactNumbers":{"work":"(07) 4979 1871","home":"(08) 9862 6017"},"addressDetails":[{"addressType":"residential","address":"6 Darren St, Sun Valley, QLD 4680"},{"addressType":"work","address":"Arthur River, WA 6315"}]}]},]
strschema = '{"fields":[{"metadata":{},"name":"recordID","nullable":true,"type":"string"},{"metadata":{},"name":"customerDetails","nullable":true,"type":{"fields":[{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"dob","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"familyMembers","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"relationship","nullable":true,"type":"string"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"contactNumbers","nullable":true,"type":{"fields":[{"metadata":{},"name":"work","nullable":true,"type":"string"},{"metadata":{},"name":"home","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"addressDetails","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"addressType","nullable":true,"type":"string"},{"metadata":{},"name":"address","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}}],"type":"struct"},"type":"array"}}],"type":"struct"}'

spark = SparkSession.builder.appName("json-un-nester").enableHiveSupport().getOrCreate()
sc = spark.sparkContext

schema = StructType.fromJson(json.loads(strschema))

datardd = sc.parallelize(rawdata)
data = spark.createDataFrame(datardd, schema=schema)

data.show()
data.select('familyMembers.name').show(truncate=False)
data.select('familyMembers.addressDetails.address').show(truncate=False)
question from:https://stackoverflow.com/questions/66056251/pyspark-dataframes-why-can-i-select-some-nested-fields-but-not-others

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

To understand this you can print the schema of :

data.select('familyMembers.addressDetails').printSchema()

#root
# |-- familyMembers.addressDetails: array (nullable = true)
# |    |-- element: array (containsNull = true)
# |    |    |-- element: struct (containsNull = true)
# |    |    |    |-- addressType: string (nullable = true)
# |    |    |    |-- address: string (nullable = true)

See here you have an array of arrays of structs which is different from the initial schema you have. So you can't directly access address from the root, but you can select the first element of the nested array then access the struct field address :

data.selectExpr("familyMembers.addressDetails[0].address").show(truncate=False)

#+--------------------------------------------------------------------------+
#|familyMembers.addressDetails AS addressDetails#29[0].address              |
#+--------------------------------------------------------------------------+
#|[29 Commonwealth St, Clifton, QLD 4361 , 20 A Yeo Ave, Highgate, SA 5063 ]|
#+--------------------------------------------------------------------------+

Or:

data.select(F.col('familyMembers.addressDetails').getItem(0).getItem("address"))

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...