Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
180 views
in Technique[技术] by (71.8m points)

python - pyspark corrupt_record while reading json file

I have a json which can't be read by spark(spark.read.json("xxx").show())

{'event_date_utc': None,'deleted': False, 'cost':1 , 'name':'Mike'}

The problem seems to be the None and False are not under single quote, and spark can't default them to boolean, null or even string.

I tried to give my spark read a schema instead of inferred by forcing those 2 column to be string and have the same error.

Feel like to me spark is trying to read the data first then apply schema then failed in the read part.

Is there a way to tell spark to read those values without modify the input data? I am using python.

question from:https://stackoverflow.com/questions/66053285/pyspark-corrupt-record-while-reading-json-file

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You input isn't a valid JSON so you can't read it using spark.read.json. Instead, you can load it as text DataFrame with spark.read.text and parse the stringified dict into json using UDF:

import ast
import json
from pyspark.sql import functions as F
from pyspark.sql.types import *

schema = StructType([
    StructField("event_date_utc", StringType(), True),
    StructField("deleted", BooleanType(), True),
    StructField("cost", IntegerType(), True),
    StructField("name", StringType(), True)
])

dict_to_json = F.udf(lambda x: json.dumps(ast.literal_eval(x)))

df = spark.read.text("xxx") 
    .withColumn("value", F.from_json(dict_to_json("value"), schema)) 
    .select("value.*")

df.show()

#+--------------+-------+----+----+
#|event_date_utc|deleted|cost|name|
#+--------------+-------+----+----+
#|null          |false  |1   |Mike|
#+--------------+-------+----+----+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...