首页 > 编程知识 正文

python多列数据合并,pythonpandas处理excel透视表

时间:2023-05-04 15:26:47 阅读:232844 作者:100

I have a question to merge two columns into one in the same dataframe(start_end), also remove null value. I intend to merge 'Start station' and 'End station' into 'station', and keep 'duration' according to the new column 'station'. I have tried pd.merge, pd.concat, pd.append, but I cannot work it out.

dataFrame of Start_end:

Duration End station Start station

14 1407 NaN 14th & V pcdxz/p>

19 509 NaN 21st & I pcdxz/p>

20 638 15th & P St NW. NaN

27 1532 NaN Massachusetts Ave & Dupont Circle chdmn/p>

28 759 NaN Adams Mill & Columbia Rd chdmn/p>

Expected output:

Duration stations

14 1407 14th & V pcdxz/p>

19 509 21st & I pcdxz/p>

20 638 15th & P pcdxz/p>

27 1532 Massachusetts Ave & Dupont Circle chdmn/p>

28 759 Adams Mill & Columbia Rd chdmn/p>

Code i have so far:

#start_end is the dataframe, 'start station', 'end station', 'duration'

start_end = pd.concat([df_start, df_endpddy)

This is what I attempted to:

station = pd.merge([start_end['Start station'pddy,start_end['End station'pddypddy)

解决方案>>> df

Duration End station Start station

0 1407 NaN 14th & V pcdxz/p>

1 509 NaN 21st & I pcdxz/p>

2 638 15th & P St NW. NaN

3 1532 NaN Massachusetts Ave & Dupont Circle chdmn/p>

4 759 NaN Adams Mill & Columbia Rd chdmn/p>

Give the two columns the same name

>>> df.columns = df.columns.str.replace('.*?station', 'station')

>>> df

Duration station station

0 1407 NaN 14th & V pcdxz/p>

1 509 NaN 21st & I pcdxz/p>

2 638 15th & P St NW. NaN

3 1532 NaN Massachusetts Ave & Dupont Circle chdmn/p>

4 759 NaN Adams Mill & Columbia Rd chdmn/p>

Stack then unstack.

>>> s = df.stack()

>>> s

0 Duration 1407

station 14th & V pcdxz/p>

1 Duration 509

station 21st & I pcdxz/p>

2 Duration 638

station 15th & P St NW.

3 Duration 1532

station Massachusetts Ave & Dupont Circle chdmn/p>

4 Duration 759

station Adams Mill & Columbia Rd chdmn/p>

dtype: object

>>> df = s.unstack()

>>> df

Duration station

0 1407 14th & V pcdxz/p>

1 509 21st & I pcdxz/p>

2 638 15th & P St NW.

3 1532 Massachusetts Ave & Dupont Circle chdmn/p>

4 759 Adams Mill & Columbia Rd chdmn/p>

>>>

This is how I think this works:

.stack creates a series with a MultiIndex and takes care of the null values for you. It aligns the second level on the column names and because the column names are the same there is only one - unstacking just produces a single column.

That's really just a guess based on the differences between Index's if you don't change the column names.

>>> # without changing column names

>>> s.index

labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4pddy, [0, 2, 0, 2, 0, 1, 0, 2, 0, 2pddypddy)

>>> # column names the same

>>> s.index

labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4pddy, [0, 1, 0, 1, 0, 1, 0, 1, 0, 1pddypddy)

Seems a bit tricky, maybe someone will comment on it.

Alternative - Using pd.concat and .dropna

>>> stations = pd.concat([df.iloc[:,1pddy,df.iloc[:,2pddypddy).dropna()

>>> stations.name = 'stations'

>>> stations

2 15th & P St NW.

0 14th & V pcdxz/p>

1 21st & I pcdxz/p>

3 Massachusetts Ave & Dupont Circle chdmn/p>

4 Adams Mill & Columbia Rd chdmn/p>

Name: stations, dtype: object

>>> df2 = pd.concat([df['Duration'pddy, stationspddy, axis=1)

>>> df2

Duration stations

0 1407 14th & V pcdxz/p>

1 509 21st & I pcdxz/p>

2 638 15th & P St NW.

3 1532 Massachusetts Ave & Dupont Circle chdmn/p>

4 759 Adams Mill & Columbia Rd chdmn/p>

版权声明:该文观点仅代表作者本人。处理文章:请发送邮件至 三1五14八八95#扣扣.com 举报,一经查实,本站将立刻删除。