Many algorithms require us to transform (encode) the classification features, i.e., to add (modify) a new column based on the value of a particular column.

To facilitate understanding, the following example DataFrame is created
Pandas

Numerical data

Let's start by discussing the conversion of continuous data, that is, adding a new column label based on the value of the Score column, that is, if the score is greater than 90, it is labeled A, the score is 80-90 is labeled B, and so on.

1.Custom function + loop traversal

First, of course, is the simplest, write a function and use a loop traversal, that is certainly a def plus a for

df1 = df.copy()

def myfun(x):
    if x>90:
        return 'A'
    elif x>=80 and x<90:
        return 'B'
    elif x>=70 and x<80:
        return 'C'
    elif x>=60 and x<70:
        return 'D'
    else:
        return 'E'
    
df1['Score_Label'] = None
for i in range(len(df1)):
    df1.iloc[i,3] = myfun(df1.iloc[i,2])

This code, I believe that all people can understand, simple and good thinking but more trouble
Pandas

There is no simpler way to do it. pandas certainly provides many functions for efficient operations, so read on.

2.Custom functions + map

Now, you can use map to kill the loop (although it's essentially a loop too)

df2 = df.copy()

def mapfun(x):
    if x>90:
        return 'A'
    elif x>=80 and x<90:
        return 'B'
    elif x>=70 and x<80:
        return 'C'
    elif x>=60 and x<70:
        return 'D'
    else:
        return 'E'

df2['Score_Label'] = df2['Score'].map(mapfun)

The result is the same
Pandas

3.Custom functions + apply

If you still want to keep your code clean, you can use custom functions + apply to kill the custom functions

 

df3 = df.copy()
df3['Score_Label'] = df3['Score'].apply(lambda x: 'A' if x > 90 else (
    'B' if 90 > x >= 80 else ('C' if 80 > x >= 70 else ('D' if 70 > x >= 60 else 'E'))))

4.Using pd.cut

Now, let's move on to the more advanced pandas functions, still coding the score, using pd.cut, and specifying the division interval, which will help you to divide the group directly

df4 = df.copy()
bins = [0, 59, 70, 80, 100]
df4['Score_Label'] = pd.cut(df4['Score'], bins)

Pandas

You can also use the labels parameter directly to modify the name of the corresponding group, which is much more convenient

df4['Score_Label_new'] = pd.cut(df4['Score'], bins, labels=[
                                'low', 'middle', 'good', 'perfect'])

Pandas

5.Using sklearn binarization

If you need to add a new column and determine whether the grade is passing or not, you can use the Binarizer function, and the code is simple and understandable

df5 = df.copy()
binerize = Binarizer(threshold = 60)
trans = binerize.fit_transform(np.array(df1['Score']).reshape(-1,1))
df5['Score_Label'] = trans

Pandas

Text-based data

The following is a description of the more common, conversion of text data for tagging. For example, add a new column, and tag the gender male and female as 0 and 1 respectively

6.Using replace

First, we introduce replace, but note that the custom function-related methods described above are still possible

df6 = df.copy()
df6['Sex_Label'] = df6['Sex'].replace(['Male','Female'],[0,1])

Pandas

Above is the gender operation, because only male and female, so you can manually specify 0, 1, but if there are many categories, you can also use pd.value_counts() to automatically specify the label, for example, the Course Name column grouping

df6 = df.copy()
value = df6['Course Name'].value_counts()
value_map = dict((v, i) for i,v in enumerate(value.index))
df6['Course Name_Label'] = df6.replace({'Course Name':value_map})['Course Name']

Pandas

7.Use map

Additional emphasis is placed on the addition of a new column, which must be able to think of MAP

df7 = df.copy()
Map = {elem:index for index,elem in enumerate(set(df["Course Name"]))}
df7['Course Name_Label'] = df7['Course Name'].map(Map)

Pandas

8.Use astype

This method should be unknown to many people, which belongs to the above-mentioned knowledge of the problem, can be achieved by too many methods

df8 = df.copy()
value = df8['Course Name'].astype('category')
df8['Course Name_Label'] = value.cat.codes

Pandas

9.Using sklearn

As with numeric, this classic operation in machine learning, sklearn must have a way to encode categorical data using LabelEncoder

from sklearn.preprocessing import LabelEncoder
df9 = df.copy()
le = LabelEncoder()
le.fit(df9['Sex'])
df9['Sex_Label'] = le.transform(df9['Sex'])
le.fit(df9['Course Name'])
df9['Course Name_Label'] = le.transform(df9['Course Name'])

Pandas

It is also possible to convert two columns at once

df9 = df.copy()
le = OrdinalEncoder()
le.fit(df9[['Sex','Course Name']])
df9[['Sex_Label','Course Name_Label']] = le.transform(df9[['Sex','Course Name']])

10.Using factorize

Finally, to introduce another niche but good pandas method, we need to note that in the above method, the automatically generated Course Name_Label column, although a data corresponds to a language, is mostly unordered because it avoids writing custom functions or dictionaries so that it can be generated automatically.

If we want it to be ordered, i.e. Python corresponds to 0 and Java corresponds to 1, what is an elegant way to do it, other than specifying it ourselves? This is where you can use factorize, which will encode it according to the order of occurrence

df10 = df.copy()
df10['Course Name_Label'] = pd.factorize(df10['Course Name'])[0]

Pandas

In combination with anonymous functions, we can achieve an ordered coding transformation of multiple columns

df10 = df.copy()
cat_columns = df10.select_dtypes(['object']).columns

df10[['Sex_Label', 'Course Name_Label']] = df10[cat_columns].apply(
    lambda x: pd.factorize(x)[0])

Pandas

keywords: pandas