Pandas : Excel의 병합 된 헤더 열을 구문 분석합니다.

Excel 시트의 데이터는 다음과 같이 저장됩니다.

   Area     |          Product1     |      Product2        |      Product3
            |      sales|sales.Value|   sales |sales.Value |  sales |sales.Value
  Location1 |    20     | 20000     |      25 |  10000     |   200  | 100
  Location2 |    30     | 30000     |      3  | 12300      |   213  | 10

제품 이름은 주어진 달 동안 1000 개 정도의 영역 각각에 대해 2 행의 "판매 불가"및 "판매액"의 두 셀을 병합 한 것입니다. 마찬가지로 지난 5 년간 매월 개별 파일이 있습니다. 또한 새로운 제품이 다른 달에 추가되거나 제거되었습니다. 따라서 다른 월 파일은 다음과 같이 보일 수 있습니다.

   Area     |          Product1     |      Product4        |      Product3

팬더를 사용하여이 데이터를 읽는 가장 좋은 방법을 포럼에서 제안 할 수 있습니까? 매월 제품 열이 다르므로 색인을 사용할 수 없습니다.

이상적으로, 위의 초기 형식을 다음과 같이 변환하고 싶습니다.

 Area      | Product1.sales|Product1.sales.Value| Product2.sales |Product2.sales.Value | 
 Location1 | 20            | 20000              | 25             | 10000               |  
 Location2 | 30            | 30000              | 3              | 12300               |

import pandas as pd
xl_file = read_excel("file path", skiprow=2, sheetname=0)
/* since the first two rows are always blank */


                  0            1        2               3                      4
      0          NaN          NaN      NaN       Auto loan                    NaN
      1  Branch Code  Branch Name   Region  No of accounts  Portfolio Outstanding
      2         3000       Name1  Central               0                      0
      3         3001       Name2  Central               0                      0

나는 그것을 자동차 대출로 바꾸고 싶다. 아무도 없다. 자동차 대출. 포트폴리오. 헤더로서 뛰어난.

해결법

==============================

1.귀하의 DataFrame이 df라고 가정하십시오.

귀하의 DataFrame이 df라고 가정하십시오.

import numpy as np
import pandas as pd

nan = np.nan
df = pd.DataFrame([
    (nan, nan, nan, 'Auto loan', nan)
    , ('Branch Code', 'Branch Name', 'Region', 'No of accounts'
       , 'Portfolio Outstanding')
    , (3000, 'Name1', 'Central', 0, 0)
    , (3001, 'Name2', 'Central', 0, 0)
])

그래서 다음과 같이 보입니다.

             0            1        2               3                      4
0          NaN          NaN      NaN       Auto loan                    NaN
1  Branch Code  Branch Name   Region  No of accounts  Portfolio Outstanding
2         3000       Name1  Central               0                      0
3         3001       Name2  Central               0                      0

그런 다음 첫 번째 전달은 NaN을 처음 두 행에 채 웁니다 (따라서 'Auto 대출 ').

df.iloc[0:2] = df.iloc[0:2].fillna(method='ffill', axis=1)

나머지 빈 NaN을 빈 문자열로 채 웁니다.

df.iloc[0:2] = df.iloc[0:2].fillna('')

이제 두 행을 함께 결합하십시오. 이것을 컬럼 레벨 값으로 지정하십시오.

df.columns = df.iloc[0:2].apply(lambda x: '.'.join([y for y in x if y]), axis=0)

마지막으로 처음 두 행을 제거합니다.

df = df.iloc[2:]

이 결과

  Branch Code Branch Name   Region Auto loan.No of accounts  \
2        3000      Name1  Central                        0   
3        3001      Name2  Central                        0   

  Auto loan.Portfolio Outstanding  
2                               0  
3                               0

또는 플랫 열 인덱스를 만드는 대신 MultiIndex 열을 만들 수 있습니다.

import numpy as np
import pandas as pd

nan = np.nan
df = pd.DataFrame([
    (nan, nan, nan, 'Auto loan', nan)
    , ('Branch Code', 'Branch Name', 'Region', 'No of accounts'
       , 'Portfolio Outstanding')
    , (3000, 'Name1', 'Central', 0, 0)
    , (3001, 'Name2', 'Central', 0, 0)
])
df.iloc[0:2] = df.iloc[0:2].fillna(method='ffill', axis=1)
df.iloc[0:2] = df.iloc[0:2].fillna('Area')

df.columns = pd.MultiIndex.from_tuples(
    zip(*df.iloc[0:2].to_records(index=False).tolist()))
df = df.iloc[2:]

이제 df는 다음과 같이 보입니다.

         Area                           Auto loan                      
  Branch Code Branch Name   Region No of accounts Portfolio Outstanding
2        3000      Name1  Central              0                     0
3        3001      Name2  Central              0                     0

열은 MultiIndex입니다.

In [275]: df.columns
Out[275]: 
MultiIndex(levels=[[u'Area', u'Auto loan'], [u'Branch Code', u'Branch Name', u'No of accounts', u'Portfolio Outstanding', u'Region']],
           labels=[[0, 0, 0, 1, 1], [0, 1, 4, 2, 3]])

이 열은 두 가지 수준이 있습니다. 첫 번째 레벨은 [u'Area ', u'Auto loan']이고, 두 번째 레벨은 [u'Branch Code ', u'Branch Name', u'No accounts ', u'Portfolio Outstanding' 부위'].

그런 다음 두 수준의 값을 지정하여 열에 액세스 할 수 있습니다.

print(df.loc[:, ('Area', 'Branch Name')])
# 2    Name1
# 3    Name2
# Name: (Area, Branch Name), dtype: object

print(df.loc[:, ('Auto loan', 'No of accounts')])
# 2    0
# 3    0
# Name: (Auto loan, No of accounts), dtype: object

MultiIndex를 사용할 때의 한 가지 장점은 특정 레벨 값을 가진 모든 열을 쉽게 선택할 수 있다는 것입니다. 예를 들어, 자동 대출과 관련이있는 하위 DataFrame을 선택하려면 다음을 사용할 수 있습니다.

In [279]: df.loc[:, 'Auto loan']
Out[279]: 
  No of accounts Portfolio Outstanding
2              0                     0
3              0                     0

MultiIndex에서 행과 열을 선택하는 방법에 대한 자세한 내용은 슬라이서를 사용하여 다중 인덱싱을 참조하십시오.

from https://stackoverflow.com/questions/27420263/pandas-parse-merged-header-columns-from-excel by cc-by-sa and MIT license

'PYTHON' 카테고리의 다른 글

[PYTHON] JS 지원이있는 Python 브라우저 에뮬레이터 [닫힘] (0)	2018.11.11
[PYTHON] 케라에서 재현 가능한 결과를 얻는 방법 (0)	2018.11.11
[PYTHON] tensorflow : .eval ()은 끝나지 않습니다. (0)	2018.11.10
[PYTHON] 파이썬 Win32 시뮬레이션 클릭 (0)	2018.11.10
[PYTHON] Matplotlib : 그림 개체를 사용하여 그림 초기화 (0)	2018.11.10

복붙노트

[PYTHON] Pandas : Excel의 병합 된 헤더 열을 구문 분석합니다.

Pandas : Excel의 병합 된 헤더 열을 구문 분석합니다.

해결법

1.귀하의 DataFrame이 df라고 가정하십시오.

'PYTHON' 카테고리의 다른 글

티스토리툴바