ml_mt_note

August 22, 2024

by Athicha Leksansern

1. Data Preprocessing
2. Distance-based Classification
3. Tree-based Classification

1. Data Preprocessing

Training Dataset

$X_n = \{ X_{n1}, X_{n2}, ..., X_{nD} \}$ เรียกว่า Feature Matrix $Y = \{ y_1, y_2, ..., y_n \}$ เรียกว่า Label Vector

$D$ คือจำนวนของ Features $N$ คือจำนวนของ Datasets

Feature Engineering

สามารถแบ่งประเภทของ Features ได้สองแบบคือ

Continuos numeric
Categorical features

Data Preprocessing

เช่นการ

Handling missing value (ค่าที่หายไป)
- ลบ Rows นั้นออก
- แทนด้วยค่าเฉลี่ย, ค่า unknown, ค่าที่น่าจะใกล้เคียงที่สุด
- แทนด้วยค่าที่พบมากที่สุด (Categorial)
Outlier Detection / Removal (ค่าที่ไม่ปกติ)
- Visually: ด้วยการทำ Scatter Plot
- Statiscally: ใช้ 2 หรือ 3 เท่าของ Standard deviations (SD) จากค่าเฉลี่ย
$S.D. = \sqrt{\sum{\dfrac{(x_i - \bar{x})^2}{n - 1}}}$
Scaling / Normalization (Data smoothing)
- แบ่งออกเป็นกลุ่มๆ แล้วเปลี่ยนค่าในกลุ่มเป็น ค่าเฉลี่ยน, boundaries, medians ของกลุ่ม
- Normalization
  - Decimal Scaling หายไปที่ละ $10^j$ (เปลี่ยนจนกว่าค่าที่มากที่สุดหารด้วย $10^j$ จะ $\le 1$ )
$v' = \dfrac{v}{10^j}$
- Min-max normalization (Mapping)
$v' = \dfrac{v - min_{old}}{max_{old} - min_{old}}(max_{new} - min_{new}) + min_{new}$
- Z-score normalization (Standardization) $\approx[3,-3]$
$v' = \dfrac{v - \bar{v}}{S.D.}$
Feature Selection
Class imbalance

Hot-deck imputation (Closest-fit algorithm)

แทนค่าที่หายไป ด้วยค่าที่ใกล้เคียงที่สุด

\begin{align} distance(x, y) & = \sum_{i=1}^{n} \color{red}{distance(a_i(x), a_i(y))} \\ \color{red}{distance(a_i(x), a_i(y))} & = \begin{cases} 0,& \text{if } a_i(x) = a_i(y)\\ 1,& \text{if } a_i(x) \neq a_i(y)\\ dist_{eud}(a_i(x), a_i(y)),& \text{if } a_i(x) \text{ and } a_i(y) \text{ is numeric}\\ \end{cases} \end{align}

ตัวอย่าง

Subject	Age	Income	Gender	$distance(a_i(x), {\color{red}a_i(8)})$
1	29	$40,000	M	$1 + \\| 40000-81000 \\| + 1 = 41002$
2	45	$36,000	M	$1 + \\| 36000-81000 \\| + 1 = 45000$
3	81		M	$1 + \\| 81000 \\| + 1 = 81002$
4	22	$16,000		$1 + \\| 16000-81000 \\| + 1 = 65002$
5	41	$98,000	M	$1 + \\| 98000-81000 \\| + 1 = 17002$
6	33	$60,000	F	$1 + \\| 60000-81000 \\| + 0 = 21001$
7	22	$24,000	F	$1 + \\| 24000-81000 \\| + 0 = 57001$
8	$\color{blue}45$	$81,000	F	-
9	33	$55,000	F	$1 + \\| 55000-81000 \\| + 0 = 26001$
10	45	$80,000	F	$\color{blue} 1 + \\| 80000-81000 \\| + 0 = 10001$

ปัญหา Curse of Dimensionality

เมื่อจำนวนของ Features เพิ่มขึ้นจะทำให้ข้อมูลเกิดการกระจาย (Thinly scattered)
สามารถแก้ปัญหาได้ 2 แบบ
- Feature Selection: เลือกเฉพาะที่มีประโยชน์
- Feature Aggregation: รวมหลายๆ features เป็นอันเดียว

Pearson Correlation (Correlation coeffienct)

ไว้หาความสัมพันธ์ของตัวเลขสองตัวแปร

r = \dfrac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum(X_i - \bar{X})^2\sum(Y_i - \bar{Y})^2}}

Mutual Information

ไว้วัดความสัมพันธ์ระหว่าง Features กับ Label
ค่า MI ที่สูง: สัมพันธ์กันสูง
ค่า MI ที่ต่ำ: สัมพันธ์กันต่ำ

I(X;Y) = \sum_{y \in Y}\sum_{x \in X}p(x,y)log_2\left(\dfrac{p(x,y)}{p(x)p(y)}\right)

ตัวอย่าง. จงคำนวนหาค่า MI ระหว่าง Outlook / Temperature

\begin{align} I(\text{Outlook};\text{Temperature}) = & \sum_{y \in \text{Temperature}}\sum_{x \in \text{Outlook}}p(x,y)log_2\left(\dfrac{p(x,y)}{p(x)p(y)}\right) \\ = & \sum_{x \in \text{Outlook}}p(x,\text{\color{red}Hot})log_2\left(\dfrac{p(x,\text{\color{red}Hot})}{p(x)p(\text{\color{red}Hot})}\right) + \\ & \sum_{x \in \text{Outlook}}p(x,\text{\color{green}Cold})log_2\left(\dfrac{p(x,\text{\color{green}Cold})}{p(x)p(\text{\color{green}Cold})}\right) + \\ & \sum_{x \in \text{Outlook}}p(x,\text{\color{blue}Mild})log_2\left(\dfrac{p(x,\text{\color{blue}Mild})}{p(x)p(\text{\color{blue}Mild})}\right) \\ = & \left[ p(\text{\color{orange}Sunny},\text{\color{red}Hot})log_2\left(\dfrac{p(\text{\color{orange}Sunny},\text{\color{red}Hot})}{p(\text{\color{orange}Sunny})p(\text{\color{red}Hot})}\right) + p(\text{\color{purple}Overcast},\text{\color{red}Hot})log_2\left(\dfrac{p(\text{\color{purple}Overcast},\text{\color{red}Hot})}{p(\text{\color{purple}Overcast})p(\text{\color{red}Hot})}\right) + p(\text{\color{lime}Rainy},\text{\color{red}Hot})log_2\left(\dfrac{p(\text{\color{lime}Rainy},\text{\color{red}Hot})}{p(\text{\color{lime}Rainy})p(\text{\color{red}Hot})}\right) \right] + \\ & \left[ p(\text{\color{orange}Sunny},\text{\color{green}Cold})log_2\left(\dfrac{p(\text{\color{orange}Sunny},\text{\color{green}Cold})}{p(\text{\color{orange}Sunny})p(\text{\color{green}Cold})}\right) + p(\text{\color{purple}Overcast},\text{\color{green}Cold})log_2\left(\dfrac{p(\text{\color{purple}Overcast},\text{\color{green}Cold})}{p(\text{\color{purple}Overcast})p(\text{\color{green}Cold})}\right) + p(\text{\color{lime}Rainy},\text{\color{green}Cold})log_2\left(\dfrac{p(\text{\color{lime}Rainy},\text{\color{green}Cold})}{p(\text{\color{lime}Rainy})p(\text{\color{green}Cold})}\right) \right] + \\ & \left[ p(\text{\color{orange}Sunny},\text{\color{blue}Mild})log_2\left(\dfrac{p(\text{\color{orange}Sunny},\text{\color{blue}Mild})}{p(\text{\color{orange}Sunny})p(\text{\color{blue}Mild})}\right) + p(\text{\color{purple}Overcast},\text{\color{blue}Mild})log_2\left(\dfrac{p(\text{\color{purple}Overcast},\text{\color{blue}Mild})}{p(\text{\color{purple}Overcast})p(\text{\color{blue}Mild})}\right) + p(\text{\color{lime}Rainy},\text{\color{blue}Mild})log_2\left(\dfrac{p(\text{\color{lime}Rainy},\text{\color{blue}Mild})}{p(\text{\color{lime}Rainy})p(\text{\color{blue}Mild})}\right) \right] \end{align}

2. Distance-based Classification

Centroid-based Classification

สร้าง Centroid (จุดศูนย์กลางของทุกจุดใน Class นั้นๆ)

\begin{align} \mu_{-} & = \dfrac{1}{N_{-}} \sum_{y_n = -1} x_n \\ \mu_{+} & = \dfrac{1}{N_{+}} \sum_{y_n = +1} x_n \end{align}

จากนั้นดูว่าใกล้ Centroid ของอันไหนมากกว่ากัน

$f(x) := ||\mu_{-} - x||^2 - ||\mu_{+} - x||^2$
- ถ้า $f(x) \ge 0$ แสดงว่าเป็น Class +, นอกนั้นเป็น Class -

ตัวอย่าง

Student	Height	Weight	Label (Gender)
1	$100.0$	$20.0$	$\color{red}{-1}$
2	$100.0$	$26.0$	$\color{green}{1}$
3	$100.0$	$30.4$	$\color{green}{1}$
4	$100.0$	$32.4$	$\color{green}{1}$
5	$101.6$	$27.0$	$\color{green}{1}$
6	$101.6$	$32.0$	$\color{green}{1}$
7	$102.0$	$21.0$	$\color{red}{-1}$
8	$103.6$	$29.6$	$\color{green}{1}$
9	$104.4$	$30.4$	$\color{green}{1}$
10	$104.9$	$22.0$	$\color{red}{-1}$
11	$105.2$	$20.0$	$\color{red}{-1}$
12	$105.6$	$34.4$	$\color{green}{1}$
13	$107.2$	$32.4$	$\color{green}{1}$
14	$109.9$	$34.9$	$\color{green}{1}$
15	$111.0$	$25.4$	$\color{red}{-1}$
16	$114.2$	$23.5$	$\color{red}{-1}$
17	$115.5$	$36.3$	$\color{green}{1}$
18	$117.8$	$26.9$	$\color{red}{-1}$

$\color{red}-1$ = Female $\color{green}1$ = Male

\begin{align} \text{Centroid of } {-1}; \text{ Weight} & = \dfrac{100+102+104.9+105.2+111+114.2+117.8}{7} \\ & = {\color{red}107.871} \\ \text{Height} & = \dfrac{20+21+22+20+25.4+23.5+26.9}{7} \\ & = {\color{red}22.685} \\ \text{Centroid of } {1}; \text{ Weight} & = \dfrac{100+100+100+101.6+101.6+103.6+104.4+105.6+107.2+109.9+115.5}{11} \\ & = {\color{green}104.490} \\ \text{Height} & = \dfrac{26.0+30.4+32.4+27.0+32.0+29.6+30.4+34.4+32.4+34.9+36.3}{11} \\ & = {\color{green}31.436} \end{align}

ถ้ามีนักเรียนคนที่ 19 มา $\text{Height} = 120$ , $\text{Weight} = 28.7$ จะได้ Gender อะไร?

\begin{align} distance({\color{red}-1},x) & = \sqrt{(120-{\color{red}107.871})^2 + (28.7-{\color{red}22.685})^2} \\ & = {\color{red}13.538} \\ distance({\color{green}1},x) & = \sqrt{(120-{\color{green}104.490})^2 + (28.7-{\color{green}31.436})^2} \\ & = {\color{green}15.749} \end{align}

$\therefore \text{Gender} = \text{\color{red}Female}$

K-NN Nearest Neightbor Classification

ดูว่า k (1, 2, 3, ...) จุดที่ใกล้ที่สุด เป็น Class อะไร (majority vote)

D(X,Y) = \sqrt{\sum^D_{i=1}(x_i - y_i)^2}

ตัวอย่าง

Petal	Sepal	Class	$D(curr, 1)$
$4.9$	$3.0$	$\text{\color{red}Setosa}$	$\sqrt{(4.9-{\color{blue}5.4})^2 + (3.0-{\color{blue}2.9})^2}={\color{red}0.509}$
$4.7$	$3.2$	$\text{\color{red}Setosa}$	$\sqrt{(4.7-{\color{blue}5.4})^2 + (3.2-{\color{blue}2.9})^2}={\color{red}0.761}$
$4.6$	$3.1$	$\text{\color{red}Setosa}$	$\sqrt{(4.6-{\color{blue}5.4})^2 + (3.1-{\color{blue}2.9})^2}=0.824$
$5.0$	$3.6$	$\text{\color{red}Setosa}$	$\sqrt{(5.0-{\color{blue}5.4})^2 + (3.6-{\color{blue}2.9})^2}=0.806$
$6.4$	$3.2$	$\text{\color{green}Versicolor}$	$\sqrt{(6.4-{\color{blue}5.4})^2 + (3.2-{\color{blue}2.9})^2}=1.044$
$6.9$	$3.1$	$\text{\color{green}Versicolor}$	$\sqrt{(6.9-{\color{blue}5.4})^2 + (3.1-{\color{blue}2.9})^2}=1.513$
$5.5$	$2.3$	$\text{\color{green}Versicolor}$	$\sqrt{(5.5-{\color{blue}5.4})^2 + (2.3-{\color{blue}2.9})^2}={\color{red}0.608}$
$6.5$	$2.8$	$\text{\color{green}Versicolor}$	$\sqrt{(6.5-{\color{blue}5.4})^2 + (2.8-{\color{blue}2.9})^2}=1.104$
$6.7$	$3.0$	$\text{\color{blue}Virginica}$	$\sqrt{(6.7-{\color{blue}5.4})^2 + (3.0-{\color{blue}2.9})^2}=1.303$
$6.3$	$2.6$	$\text{\color{blue}Virginica}$	$\sqrt{(6.3-{\color{blue}5.4})^2 + (2.6-{\color{blue}2.9})^2}=0.948$
$6.5$	$3.0$	$\text{\color{blue}Virginica}$	$\sqrt{(6.5-{\color{blue}5.4})^2 + (3.0-{\color{blue}2.9})^2}=1.104$

ข้อมูลที่ต้องการทดสอบ

No	Petal	Sepal	Class
1	5.4	2.9	??

ถ้า $k = 3$ ค่าที่น้อยสุด 3 อันดับแรกคือ
- $0.509$ , $\text{\color{red}Setosa}$
- $0.608$ , $\text{\color{green}Versicolor}$
- $0.761$ , $\text{\color{red}Setosa}$
จะได้ $\text{\color{red}Setosa}$ เยอะที่สุด คำตอบจึงเป็น $\text{\color{red}Setosa}$

K-dimensional Tree (K-d Tree)

เป็นวิธีการเก็บข้อมูลสำหรับการหา Nearest Neighbor
โดยจะแบ่งเป็นหลายๆ มิติ
วิธ๊สร้าง
1. นำข้อมูลมาเรียงด้วยแกน x
2. แบ่งข้อมูลด้วยค่า Median
3. กลับไปข้อ 1. ด้วยแกน y
วิธีใช้: คล้ายๆ กับ BST
ตัวอย่าง

X Y
7 2
5 9
9 6
4 7
8 1
7 6

$(7,2), (5,9), (9,6), (4,7), (8,1), (7,6)$ เรียงด้วยแกน $x$ $({\color{red}4},7), ({\color{red}5},9), ({\color{red}7},2), ({\color{red}7},6), ({\color{red}8},1), ({\color{red}9},6)$ จะได้ค่า Median = $7$ , จากนั้นแบ่งข้อมูลด้วย $\lt 7$ กับ $\ge 7$

ฝั่ง $\lt 7$ , $(4,7),(5,9)$ เรียงด้วยแกน y $(4,{\color{blue}7}),(5,{\color{blue}9})$ จะได้ค่า Median = $8$ , จากนั้นแบ่งข้อมูลด้วย $\lt 8$ กับ $\ge 8$

ฝั่ง $\ge 7$ , $(8,{\color{blue}1}),(7,{\color{blue}2}),(7,{\color{blue}6}),(9,{\color{blue}6})$ เรียงด้วยแกน y $(8,{\color{blue}1}),(7,{\color{blue}2}),(7,{\color{blue}6}),(9,{\color{blue}6})$ จะได้ค่า Median = $4$ , จากนั้นแบ่งข้อมูลด้วย $\lt 4$ กับ $\ge 4$

X	Y
7	2
5	9
9	6
4	7
8	1
7	6

💻 Code

Centroid-based classification

from sklearn.neighbors import NearestCentroid

K-neighbors classification

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5, algorithm='auto')
# algorithm = 'auto', 'kd_tree', 'brute'

Create random dataset

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000,n_features=4,n_classes=2)

Train test split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Hyper-parameter tuning (GridSearchCV)

from sklearn.model_selection import GridSearchCV

params = {'n_neighbors': list(range(1, 10))} # 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9]
clf = GridSearchCV(model, params)
clf.fit(X_train, y_train)
clf.best_params_
# {'n_neighbors': 9}
clf.best_estimator_
# เอา Model ที่ดีที่สุดออกมา

Saving / Import model

import pickle

pickle.dump(model, open('my_model.sav', 'wb'))
loaded_model = pickle.load(open('my_model.sav', 'rb'))

3. Tree-based Classification

Decision Tree

ประกอบไปด้วย

Internal Node: จุดที่จะตัดสินใจ (จาก Feature ใด Feature หนึ่ง)
Branch (Edge/เส้น): ผลลัพท์ของการตัดสินใจ
Left Node ผลลัพท์สุดท้าย (Label)

Entropy measurement (วัดค่าความไม่แน่นอน)

เป็นการวัดค่าความสุ่ม ความไม่นอน
$H(S) \in [0, 1]$
$C$ คือ Labels, $S$ ข้อมูล row นึง

H(S) = -\sum_{c \in C} p_c \cdot log_2 \left( p_c \right)

โดยจะเลือกแบ่ง Decision Tree จากค่า Entropy ที่น้อยที่สุด
- ยิ่งค่ามาก จำนวนของแต่ละ Class จะเท่าๆ กัน
- ยิ่งค่าน้อย จำนวนของแต่ละ Class จะต่างกัน
ตัวอย่าง

Gini Index

เป็นการวัดค่าความไม่เท่าเทียม/ความบริสุทธ์ ของแต่ละ Class
$Gini(S) \in [0, 0.5]$
เลือกแบ่ง Decision Tree จากค่า Gini ที่น้อยที่สุด
- ถ้า $Gini(S) = 0$ หมายความว่ามี 1 Class เท่านั้น
- ถ้า $Gini(S) = 0.5$ หมายความว่าทุกๆ Class มีจำนวนเท่ากัน

Gini(S) = 1 - \sum_{c \in C}P_c^2

ตัวอย่าง

Information Gain

สำหรับคำนวนว่าควรแบ่ง Decision Tree ตรงไหน
$|S_f|$ จำนวน sample ที่อยู่ใน feature $f$
$|S|$ จำนวน sample
$H(S_f)$ ค่า Entropy ของ Label เมื่อถ้า

IG(S,F) = H(S) - \sum_{f \in F} \dfrac{| S_f |}{| S |}H(S_f)

ตัวอย่าง

Name	Hair	Height	Weight	Lotion	Result
Sarah	Blonde	Average	Light	No	Subburned
Dana	Blonde	Tall	Average	Yes	None
Alex	Brown	Short	Average	Yes	None
Annie	Blonde	Short	Average	No	Subburned
Emily	Red	Average	Heavy	No	Subburned
Pete	Brown	Tall	Heavy	No	None
John	Brown	Average	Heavy	No	None
Katie	Blonde	Short	Light	Yes	None

คำนวน Entropy ของ Label

\begin{align} H(\text{Result}) & = - \left[P(\text{\color{red}Subburned})log2(\text{\color{red}Subburned}) + P(\text{\color{blue}None})log2(\text{\color{blue}None})\right] \\ & = - \left[ {\color{red}(\dfrac{3}{8})log2(\dfrac{3}{8})} + {\color{blue}(\dfrac{5}{8})log2(\dfrac{5}{8})} \right] \\ & = 0.9544340029 \end{align}

คำนวน Entropy ของ Label เมื่อแบ่ง Feature ด้วย Hair

Name	Hair	Height	Weight	Lotion	Result
Sarah	Blonde	Average	Light	No	Subburned
Dana	Blonde	Tall	Average	Yes	None
Annie	Blonde	Short	Average	No	Subburned
Katie	Blonde	Short	Light	Yes	None

\begin{align} H(\text{Text}|\text{Hair}="Blonde") & = - \left[P(\text{\color{red}Subburned})log2(\text{\color{red}Subburned}) + P(\text{\color{blue}None})log2(\text{\color{blue}None})\right] \\ & = - \left( {\color{red}(\dfrac{2}{4})log_2(\dfrac{2}{4})} + {\color{blue}(\dfrac{2}{4})log_2(\dfrac{2}{4})} \right) \\ & = 1 \end{align}

Name	Hair	Height	Weight	Lotion	Result
Emily	Red	Average	Heavy	No	Subburned

\begin{align} H(\text{Text}|\text{Hair}="Red") & = - \left[P(\text{\color{red}Subburned})log2(\text{\color{red}Subburned}) + P(\text{\color{blue}None})log2(\text{\color{blue}None})\right] \\ & = - \left( {\color{red}(\dfrac{1}{1})log_2(\dfrac{1}{1})} + {\color{blue}(\dfrac{0}{1})log_2(\dfrac{0}{1})} \right) \\ & = 0 \end{align}

Name	Hair	Height	Weight	Lotion	Result
Alex	Brown	Short	Average	Yes	None
Pete	Brown	Tall	Heavy	No	None
John	Brown	Average	Heavy	No	None

\begin{align} H(\text{Text}|\text{Hair}="Brown") & = - \left[P(\text{\color{red}Subburned})log2(\text{\color{red}Subburned}) + P(\text{\color{blue}None})log2(\text{\color{blue}None})\right] \\ & = - \left( {\color{red}(\dfrac{0}{3})log_2(\dfrac{0}{3})} + {\color{blue}(\dfrac{3}{3})log_2(\dfrac{3}{3})} \right) \\ & = 0 \end{align}

ค่า Information Gain เมื่อแบ่งด้วย Hair จึงได้ (Blonde, Red, Brown ตามลำดับ)

\begin{align} IG(\text{Hair}) & = H(\text{Result}) - \left[ \dfrac{|S_{Blonde}|}{|S|}H(Blonde) + \dfrac{|S_{Red}|}{|S|}H(Red) + \dfrac{|S_{Brown}|}{|S|}H(Brown) \right] \\ & = 0.954 - \left[ \dfrac{4}{8}(1) + \dfrac{1}{8}(0) + \dfrac{3}{8}(0) \right] \\ & = 0.454 \end{align}

ค่า Information Gain เมื่อแบ่งด้วย Weight (Light, Average, Heavy)

\begin{align} IG(\text{Weight}) & = 0.964 - \left[ \dfrac{2}{8}(1) + \dfrac{3}{8}(0.9182958341) + \dfrac{3}{8}(0.9182958341) \right] \\ & = 0.015 \end{align}

ค่า Information Gain เมื่อแบ่งด้วย Lotion (Yes, No)

\begin{align} IG(\text{Lotion}) & = 0.964 - \left[ \dfrac{3}{8}(0) + \dfrac{5}{8}(0.9709505945) \right] \\ & = 0.347 \end{align}

$\therefore$ จากค่า IG เมื่อแบ่งด้วย Hair, Weight, Lotion จะได้ค่า IG สูงสุดเมื่อแบ่งด้วย Hair

Tree Ensemble

การรวม Decision Tree หลายๆ อันมาช่วยตัดสินใจ
สามารถทำได้ 2 แบบคือ
- Bagging: แบ่ง Training data ออกเป็นหลายๆ แล้วเอาไปทำ Model เมื่อใช้งานก็จะให้ดูว่าคำตอบไหนเป็นคำตอบที่ตอบเยอะที่สุด
- Boosting: สร้าง Decision Tree มาต่อๆ กัน เพื่อให้ Decision อันต่อมาแก้คำตอบของอันก่อนหน้า
  - มักจะสร้าง Weak Tree ที่มีเพียง 1-2 ชั้น
  - แต่จะเอาคำตอบของต้นแรก มาทำต้นต่อมาเพื่อลดข้อผิดพลาด
  - เช่น XGBoost

💻 Code

สร้าง Decision Tree

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=5, criterion='entropy')
model.fit(X_train, y_train)

ดูกราฟ Decision Tree

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(20, 10))
tree.plot_tree(model, filled=True, features_names=X_train.columns, class_names=['<=50K', '>50K'])
plt.show()

เปลี่ยนข้อมูล Categorical แบบตัวอักษร หรืออื่นๆ ไปเป็น ตัวเลข (Label Encode)

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['f1'] = label_encoder.fit_transform(df['f1'])

ทำ Random Forest (Tree Ensemble)

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=10, criterion='entropy', max_depth=5)
model.fit(X_train, y_train)

ดูกราฟ Random Forest

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

fig, axes = plt.subplot(nrows=1, ncols=3, figsize=(10, 2), dpi=900)
for idx in range(3):
  plot_tree(model.estimators_[idx], filled=True,
    feature_names=X_train.columns,
    class_names=['<=50K', '>50K'],
    ax=axes.ravel()[idx]
  )

สร้าง XGBoost

import xgboost as xgb

model = xgb.XGBClassifier(objective='multi:softmax', n_estimators=10, max_depth=5, num_class=3)
mode.fit(X_train, X_test)