Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
837 views
in Technique[技术] by (71.8m points)

regex - Extract numbers, letters, or punctuation from left side of string column in Python

Say I have the following data frame which comes from OCR has company_info column contains numbers, letters, or punctuation and Chinese characters:

import pandas as pd

data = '''
id,company_info
1, 05B01北京企商联登记注册代理事务所(通合伙)
2, Unit-D 608华夏启商(北京企业管理有限公司)
3, 1004-1005北京中睿智诚商业管理有限公司
4, 17/F(1706)北京美泰德商务咨询有限公司
5, A2006~A2007北京新曙光会计服务有限公司
6, 2906-10中国建筑与室内设计师网'''

df = pd.read_csv(pd.compat.StringIO(data), sep=',')

I want to extract numbers, letters, or punctuation from the left side of strings as columns of office_name, and the rest as a column of company_info. How can I do that in Python? Thanks.

The expected output is like this:

   id   office_name          company_info
0   1         05B01   北京企商联登记注册代理事务所(通合伙)
1   2    Unit-D 608      华夏启商(北京企业管理有限公司)
2   3     1004-1005        北京中睿智诚商业管理有限公司
3   4    17/F(1706)         北京美泰德商务咨询有限公司
4   5   A2006~A2007         北京新曙光会计服务有限公司
5   6       2906-10           中国建筑与室内设计师网
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Use Series.str.extract with DataFrame.pop for extract column:

pat = r'([x00-x7F]+)([u4e00-u9fff]+.*$)'
df[['office_name','company_info']] = df.pop('company_info').str.extract(pat)
print (df)
   id   office_name         company_info
0   1         05B01  北京企商联登记注册代理事务所(通合伙)
1   2    Unit-D 608     华夏启商(北京企业管理有限公司)
2   3     1004-1005       北京中睿智诚商业管理有限公司
3   4    17/F(1706)        北京美泰德商务咨询有限公司
4   5   A2006~A2007        北京新曙光会计服务有限公司
5   6       2906-10          中国建筑与室内设计师网

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...