Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
107 views
in Technique[技术] by (71.8m points)

python - Add number to a string before the last character in the string using regex in pyspark

I need to add the same number before the last character in a string (thats in a column of a spark dataframe) using pyspark. For example, say I have the string 2020_week4 or 2021_week5. I need to add a zero in front of 4 and the 5 like so: 2020_week04 or 2021_week05. The larger context is that the replacement is conditional -only for single digit weeks. So something along the lines of:

df.withColumn('week', when(len(col("week")) == 10, regexp_replace(week, REGEX_PATTERN, "0")).otherwise(col("week")))

Things to note, the week column will always be 10 characters long for the single digit strings that need replacing.

Per @thefourthbird 's suggestion in regards to the regex statement, I tried the following:

df1.withColumn('week', when(len(col("week")) == 10, regexp_replace(week, "^d{4}_week(?=d$)", "$00")).otherwise(col("week")))

The error I'm getting has nothing to do with the regex itself but rather how to implement regex in general in pyspark. Error:

TypeError: object of type 'Column' has no len()

I also tried:

import pyspark.sql.functions as F

df1.withColumn('week', when(F.length("week") == 10, regexp_replace(week, "^d{4}_week(?=d$)", "$00")).otherwise(col("week")))

Error:

NameError: name 'week' is not defined

UPDATE:

df10.withColumn('week', when(length(col('week')) == 10, regexp_replace("week", "(?<=k)(?=d$)", "0")).otherwise(col("week")))
question from:https://stackoverflow.com/questions/65893437/add-number-to-a-string-before-the-last-character-in-the-string-using-regex-in-py

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can use substring and concat functions which will work for any string (no need to use regex) :

from pyspark.sql import functions as F


df = spark.createDataFrame([("2020_week4",), ("2021_week5",)], ["value"])

df.withColumn(
    "value",
    F.concat(
        F.expr("substring(value, 1, length(value)-1)"),
        F.lit('0'),
        F.substring("value", -1, 1)
    )
).show()

#+-----------+
#|      value|
#+-----------+
#|2020_week04|
#|2021_week05|
#+-----------+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...