판다를 사용하여 하나의 데이터 프레임에서 테스트 및 기차 샘플을 만드는 방법은 무엇입니까?

필자는 데이터 프레임의 형태로 상당히 큰 데이터 세트를 보유하고 있으며 교육 및 테스트를 위해 데이터 프레임을 두 개의 무작위 샘플 (80 % 및 20 %)로 분리 할 수 있을지 궁금합니다.

감사!

해결법

==============================

1.numpy의 randn을 사용합니다.

numpy의 randn을 사용합니다.

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

그리고 이것을 보았습니다.

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79

==============================

2.scikit learn의 train_test_split은 좋은 것입니다.

scikit learn의 train_test_split은 좋은 것입니다.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

==============================
3.팬더 무작위 샘플도 작동합니다.

팬더 무작위 샘플도 작동합니다.
```
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
```
==============================
4.scikit-learn 자신의 training_test_split을 사용하고 인덱스에서 생성합니다.

scikit-learn 자신의 training_test_split을 사용하고 인덱스에서 생성합니다.
```
from sklearn.cross_validation import train_test_split


y = df.pop('output')
X = df

X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train
```
==============================
5.아래 코드를 사용하여 테스트 및 훈련 샘플을 만들 수 있습니다.

아래 코드를 사용하여 테스트 및 훈련 샘플을 만들 수 있습니다.
```
from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)
```
테스트 크기는 테스트 및 기차 데이터 세트에 넣으려는 데이터의 비율에 따라 달라질 수 있습니다.
==============================
6.많은 유효한 답변이 있습니다. 한 무리 더 추가. sklearn.cross_validation에서 가져 오기 import train_test_split

많은 유효한 답변이 있습니다. 한 무리 더 추가. sklearn.cross_validation에서 가져 오기 import train_test_split
```
#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]
```

==============================

7.또한 계층화 된 부문을 교육 및 테스트 세트로 고려할 수 있습니다. 또한 시작 부문은 원래의 수업 비율이 유지되는 방식으로 무작위로 설정되는 교육 및 테스트를 생성합니다. 이로 인해 교육 및 테스트 세트가 원본 데이터 집합의 속성을보다 잘 반영합니다.

또한 계층화 된 부문을 교육 및 테스트 세트로 고려할 수 있습니다. 또한 시작 부문은 원래의 수업 비율이 유지되는 방식으로 무작위로 설정되는 교육 및 테스트를 생성합니다. 이로 인해 교육 및 테스트 세트가 원본 데이터 집합의 속성을보다 잘 반영합니다.

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

df [train_inds] 및 df [test_inds]는 원본 DataFrame df의 교육 및 테스트 세트를 제공합니다.

==============================
8.이것은 DataFrame을 분할해야 할 때 작성한 것입니다. 위의 Andy의 접근 방법을 사용했지만 데이터 세트의 크기를 정확하게 제어 할 수 없다는 점을 고려했습니다 (예 : 79, 때때로 81 등).

이것은 DataFrame을 분할해야 할 때 작성한 것입니다. 위의 Andy의 접근 방법을 사용했지만 데이터 세트의 크기를 정확하게 제어 할 수 없다는 점을 고려했습니다 (예 : 79, 때때로 81 등).
```
def make_sets(data_df, test_portion):
    import random as rnd

    tot_ix = range(len(data_df))
    test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
    train_ix = list(set(tot_ix) ^ set(test_ix))

    test_df = data_df.ix[test_ix]
    train_df = data_df.ix[train_ix]

    return train_df, test_df


train_df, test_df = make_sets(data_df, 0.2)
test_df.head()
```
==============================
9.이처럼 df에서 범위 행을 선택하기 만하면됩니다.

이처럼 df에서 범위 행을 선택하기 만하면됩니다.
```
row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]
```
==============================
10.당신의 소원이 하나의 데이터 프레임과 두 개의 데이터 프레임을 가지면 (numpy 배열이 아님), 이것은 트릭을 수행해야합니다 :

당신의 소원이 하나의 데이터 프레임과 두 개의 데이터 프레임을 가지면 (numpy 배열이 아님), 이것은 트릭을 수행해야합니다 :
```
def split_data(df, train_perc = 0.8):

   df['train'] = np.random.rand(len(df)) < train_perc

   train = df[df.train == 1]

   test = df[df.train == 0]

   split_data ={'train': train, 'test': test}

   return split_data
```
==============================
11.나중에 열을 추가하려면 데이터 프레임을 복사하지 않아도됩니다.

나중에 열을 추가하려면 데이터 프레임을 복사하지 않아도됩니다.
```
msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)
```
==============================
12.df.as_matrix () 함수를 사용하여 Numpy 배열을 생성하고 전달할 수 있습니다.

df.as_matrix () 함수를 사용하여 Numpy 배열을 생성하고 전달할 수 있습니다.
```
Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)
```

==============================

13.이건 어때? df는 내 데이터 프레임입니다.

이건 어때? df는 내 데이터 프레임입니다.

total_size=len(df)

train_size=math.floor(0.66*total_size) (2/3 part of my dataset)

#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)

==============================

14.데이터 세트의 라벨 열과 관련하여 데이터를 분할해야하는 경우 다음을 사용할 수 있습니다.

데이터 세트의 라벨 열과 관련하여 데이터를 분할해야하는 경우 다음을 사용할 수 있습니다.

def split_to_train_test(df, label_column, train_frac=0.8):
    train_df, test_df = pd.DataFrame(), pd.DataFrame()
    labels = df[label_column].unique()
    for lbl in labels:
        lbl_df = df[df[label_column] == lbl]
        lbl_train_df = lbl_df.sample(frac=train_frac)
        lbl_test_df = lbl_df.drop(lbl_train_df.index)
        print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
        train_df = train_df.append(lbl_train_df)
        test_df = test_df.append(lbl_test_df)

    return train_df, test_df

그것을 사용하십시오 :

train, test = split_to_train_test(data, 'class', 0.7)

분할 임의성을 제어하거나 전역 임의적 시드를 사용하려면 random_state를 전달할 수도 있습니다.

==============================
15.train, test 및 validation과 같은 두 가지 이상의 클래스로 분리하려면 다음을 수행 할 수 있습니다.

train, test 및 validation과 같은 두 가지 이상의 클래스로 분리하려면 다음을 수행 할 수 있습니다.
```
probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85


df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]
```
이것은 교육에 데이터의 70 %, 테스트에 15 %, 유효성 검사에 15 %를 적용합니다.

==============================

16.

import pandas as pd

from sklearn.model_selection import train_test_split

datafile_name = 'path_to_data_file'

data = pd.read_csv(datafile_name)

target_attribute = data['column_name']

X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)

from https://stackoverflow.com/questions/24147278/how-do-i-create-test-and-train-samples-from-one-dataframe-with-pandas by cc-by-sa and MIT license

'PYTHON' 카테고리의 다른 글

[PYTHON] 파이썬에서 설정 파일을 사용하는 가장 좋은 방법은 무엇입니까? [닫은] (0)	2018.10.09
[PYTHON] 하나의 파일에 몇 개의 클래스를 넣어야합니까? [닫은] (0)	2018.10.09
[PYTHON] 올바른 방법으로 곡선을 부드럽게하는 방법은 무엇입니까? (0)	2018.10.09
[PYTHON] 파이썬에서 비동기 메서드 호출? (0)	2018.10.09
[PYTHON] Django 사이트에서 HTML을 PDF로 렌더링 (0)	2018.10.09

복붙노트

[PYTHON] 판다를 사용하여 하나의 데이터 프레임에서 테스트 및 기차 샘플을 만드는 방법은 무엇입니까?

판다를 사용하여 하나의 데이터 프레임에서 테스트 및 기차 샘플을 만드는 방법은 무엇입니까?

해결법

1.numpy의 randn을 사용합니다.

2.scikit learn의 train_test_split은 좋은 것입니다.

3.팬더 무작위 샘플도 작동합니다.

4.scikit-learn 자신의 training_test_split을 사용하고 인덱스에서 생성합니다.

5.아래 코드를 사용하여 테스트 및 훈련 샘플을 만들 수 있습니다.

6.많은 유효한 답변이 있습니다. 한 무리 더 추가. sklearn.cross_validation에서 가져 오기 import train_test_split

8.이것은 DataFrame을 분할해야 할 때 작성한 것입니다. 위의 Andy의 접근 방법을 사용했지만 데이터 세트의 크기를 정확하게 제어 할 수 없다는 점을 고려했습니다 (예 : 79, 때때로 81 등).

9.이처럼 df에서 범위 행을 선택하기 만하면됩니다.

10.당신의 소원이 하나의 데이터 프레임과 두 개의 데이터 프레임을 가지면 (numpy 배열이 아님), 이것은 트릭을 수행해야합니다 :

11.나중에 열을 추가하려면 데이터 프레임을 복사하지 않아도됩니다.

12.df.as_matrix () 함수를 사용하여 Numpy 배열을 생성하고 전달할 수 있습니다.

13.이건 어때? df는 내 데이터 프레임입니다.

14.데이터 세트의 라벨 열과 관련하여 데이터를 분할해야하는 경우 다음을 사용할 수 있습니다.

15.train, test 및 validation과 같은 두 가지 이상의 클래스로 분리하려면 다음을 수행 할 수 있습니다.

16.

'PYTHON' 카테고리의 다른 글

티스토리툴바