{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"# 決定木から始める機械学習"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"hidden-input",
"hidden-output",
"remove-input",
"remove-output"
]
},
"outputs": [],
"source": [
"try:\n",
" import category_encoders\n",
" import graphviz\n",
"except:\n",
" !pip install graphviz\n",
" !pip install category_encoders"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# 表形式のデータを操作するためのライブラリ\n",
"import pandas as pd\n",
"\n",
"# 機械学習用ライブラリsklearn\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.metrics import accuracy_score\n",
"from sklearn.tree import export_graphviz\n",
"\n",
"# その他\n",
"import category_encoders\n",
"\n",
"# グラフ描画ライブラリ\n",
"from graphviz import Source\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"## クイズ"
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"以下のコードを実行して`income_df`に格納されるデータは,ある年にアメリカで実施された国勢調査のデータである."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" workclass | \n",
" fnlwgt | \n",
" education | \n",
" education-num | \n",
" marital-status | \n",
" occupation | \n",
" relationship | \n",
" race | \n",
" sex | \n",
" capital-gain | \n",
" capital-loss | \n",
" hours-per-week | \n",
" native-country | \n",
" income | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 39 | \n",
" State-gov | \n",
" 77516 | \n",
" Bachelors | \n",
" 13 | \n",
" Never-married | \n",
" Adm-clerical | \n",
" Not-in-family | \n",
" White | \n",
" Male | \n",
" 2174 | \n",
" 0 | \n",
" 40 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" 1 | \n",
" 50 | \n",
" Self-emp-not-inc | \n",
" 83311 | \n",
" Bachelors | \n",
" 13 | \n",
" Married-civ-spouse | \n",
" Exec-managerial | \n",
" Husband | \n",
" White | \n",
" Male | \n",
" 0 | \n",
" 0 | \n",
" 13 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" 2 | \n",
" 38 | \n",
" Private | \n",
" 215646 | \n",
" HS-grad | \n",
" 9 | \n",
" Divorced | \n",
" Handlers-cleaners | \n",
" Not-in-family | \n",
" White | \n",
" Male | \n",
" 0 | \n",
" 0 | \n",
" 40 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" 3 | \n",
" 53 | \n",
" Private | \n",
" 234721 | \n",
" 11th | \n",
" 7 | \n",
" Married-civ-spouse | \n",
" Handlers-cleaners | \n",
" Husband | \n",
" Black | \n",
" Male | \n",
" 0 | \n",
" 0 | \n",
" 40 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" 4 | \n",
" 28 | \n",
" Private | \n",
" 338409 | \n",
" Bachelors | \n",
" 13 | \n",
" Married-civ-spouse | \n",
" Prof-specialty | \n",
" Wife | \n",
" Black | \n",
" Female | \n",
" 0 | \n",
" 0 | \n",
" 40 | \n",
" Cuba | \n",
" <=50K | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age workclass fnlwgt education education-num \\\n",
"0 39 State-gov 77516 Bachelors 13 \n",
"1 50 Self-emp-not-inc 83311 Bachelors 13 \n",
"2 38 Private 215646 HS-grad 9 \n",
"3 53 Private 234721 11th 7 \n",
"4 28 Private 338409 Bachelors 13 \n",
"\n",
" marital-status occupation relationship race sex \\\n",
"0 Never-married Adm-clerical Not-in-family White Male \n",
"1 Married-civ-spouse Exec-managerial Husband White Male \n",
"2 Divorced Handlers-cleaners Not-in-family White Male \n",
"3 Married-civ-spouse Handlers-cleaners Husband Black Male \n",
"4 Married-civ-spouse Prof-specialty Wife Black Female \n",
"\n",
" capital-gain capital-loss hours-per-week native-country income \n",
"0 2174 0 40 United-States <=50K \n",
"1 0 0 13 United-States <=50K \n",
"2 0 0 40 United-States <=50K \n",
"3 0 0 40 United-States <=50K \n",
"4 0 0 40 Cuba <=50K "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# データの読み込み\n",
"income_df = pd.read_table(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data\", sep=',', header=None)\n",
"\n",
"# 列名(特徴)に名前を付ける\n",
"income_df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', \n",
" 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']\n",
"\n",
"# データ表示(先頭5件)\n",
"income_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"データ中の列名(特徴量)の意味は以下の通りである:\n",
"\n",
"* age: 年齢(整数)\n",
"* workclass: 雇用形態(公務員,会社員など)\n",
"* fnlwgt: 使わない\n",
"* education: 学歴\n",
"* education-num: 使わない\n",
"* marital-status: 婚姻状態\n",
"* occupation: 職業\n",
"* relationship: 家族内における役割\n",
"* race: 人種\n",
"* sex: 性別\n",
"* capital-gain: 使わない\n",
"* capital-loss: 使わない\n",
"* hours-per-week: 週あたりの労働時間(整数値)\n",
"* native-country: 出身国\n",
"* income: 年収(50Kドル以上,50Kドル未満の二値)\n",
"\n",
"このデータに対して決定木アルゴリズムを適用して,ある人物が年間収入が50Kドル以上か未満かを分類する機械学習モデルを構築したい."
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"(L2-Q1)=\n",
"### Q1: ヒストグラム \n",
"機械学習モデルを構築する前に,`income_df`データに含まれる調査対象者の年齢の分布を知りたい.\n",
"年齢に関するヒストグラム(階級数は10)を作成せよ.\n",
"\n",
"※ ヒント: ヒストグラムの作成には`pandas.series.hist`関数を用いるとよい([参考](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAAWH0lEQVR4nO3dfYxc1X3G8e8TnIDjTb12ICvXdmsqLBBhi8Er2yhptIsbYyCKUUUQkRUW5Nb/OClUroppRZ3wojoqhIKUoFqxG5MXNtQJxXIIxDVetVTCgMOL36DegAleGTvBxumCQ7r01z/uWTJs1t6Z3dnZGc7zkVY799xz5/7m7bl3zty5o4jAzMzy8IHxLsDMzGrHoW9mlhGHvplZRhz6ZmYZceibmWVkwngXcDKnn356zJo1q2bre/PNN5k0aVLN1jdSjVBnI9QIrrPaGqHORqgRRlfnjh07fhkRZww5MyLq9m/u3LlRS9u2bavp+kaqEepshBojXGe1NUKdjVBjxOjqBJ6OE+Sqh3fMzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDJS16dhsMYxa9WP3jO9srWfawe1jYX9ay4f83WYvZ94T9/MLCNlhb6kZkkbJb0gaa+kiyRNlbRF0r70f0rqK0n3SOqR9LykC0uupzP13yepc6xulJmZDa3cPf27gUci4hzgfGAvsArYGhGzga1pGuBSYHb6Ww7cCyBpKrAamA/MA1YPbCjMzKw2hh3TlzQZ+BRwLUBE/Ab4jaQlQHvqtgHoBm4ElgD3pTO9PZHeJUxLfbdExJF0vVuAxcD91bs5eRs8rm5mNpiKbD5JB2kOsBbYQ7GXvwO4HuiNiObUR8DRiGiWtBlYExGPp3lbKTYG7cBpEXFbar8ZOB4Rdwxa33KKdwi0tLTM7erqqsoNLUdfXx9NTU01W99InajOnb3HxqGaobVMhEPHx349rdMnj2r5Rn/M600j1NkINcLo6uzo6NgREW1DzSvn6J0JwIXAlyJiu6S7+e1QDgAREZJOvvUoU0SspdjI0NbWFu3t7dW42rJ0d3dTy/WN1InqrMXRMuVa2drPnTvH/uCw/UvbR7V8oz/m9aYR6myEGmHs6ixnTP8AcCAitqfpjRQbgUNp2Ib0/3Ca3wvMLFl+Rmo7UbuZmdXIsKEfEa8Br0o6OzUtpBjq2QQMHIHTCTyULm8CrklH8SwAjkXEQeBRYJGkKekD3EWpzczMaqTc999fAr4r6UPAS8B1FBuMByQtA14Brkp9HwYuA3qAt1JfIuKIpFuBp1K/WwY+1H2/GesPVGv1xScze/8pK/Qj4llgqA8FFg7RN4AVJ7ie9cD6CuozM7Mq8jdyzcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy8r7+jdxKT4fg0xuY2fud9/TNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsI2WFvqT9knZKelbS06ltqqQtkval/1NSuyTdI6lH0vOSLiy5ns7Uf5+kzrG5SWZmdiKV7Ol3RMSciGhL06uArRExG9iapgEuBWanv+XAvVBsJIDVwHxgHrB6YENhZma1MZrhnSXAhnR5A3BFSft9UXgCaJY0DbgE2BIRRyLiKLAFWDyK9ZuZWYUUEcN3kl4GjgIB/HNErJX0RkQ0p/kCjkZEs6TNwJqIeDzN2wrcCLQDp0XEban9ZuB4RNwxaF3LKd4h0NLSMrerq2vEN25n77GK+rdMhEPHR7y6mmmEOmtVY+v0yaNavq+vj6ampipVM3ZcZ/U0Qo0wujo7Ojp2lIzKvEe5v5H7yYjolfQxYIukF0pnRkRIGn7rUYaIWAusBWhra4v29vYRX1elv3e7srWfO3fW/88GN0Kdtapx/9L2US3f3d3NaJ5jteI6q6cRaoSxq7Os4Z2I6E3/DwMPUozJH0rDNqT/h1P3XmBmyeIzUtuJ2s3MrEaGDX1JkyR9ZOAysAjYBWwCBo7A6QQeSpc3Adeko3gWAMci4iDwKLBI0pT0Ae6i1GZmZjVSzvvvFuDBYtieCcD3IuIRSU8BD0haBrwCXJX6PwxcBvQAbwHXAUTEEUm3Ak+lfrdExJGq3RIzMxvWsKEfES8B5w/R/jqwcIj2AFac4LrWA+srL9PMzKrB38g1M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwyUnboSzpF0jOSNqfpMyVtl9Qj6fuSPpTaT03TPWn+rJLruCm1vyjpkqrfGjMzO6lK9vSvB/aWTH8VuCsizgKOAstS+zLgaGq/K/VD0rnA1cDHgcXANySdMrryzcysEmWFvqQZwOXAN9O0gIuBjanLBuCKdHlJmibNX5j6LwG6IuLtiHgZ6AHmVeE2mJlZmRQRw3eSNgL/AHwE+GvgWuCJtDePpJnAjyPiPEm7gMURcSDN+xkwH/hyWuY7qX1dWmbjoHUtB5YDtLS0zO3q6hrxjdvZe6yi/i0T4dDxEa+uZhqhzlrV2Dp98qiW7+vro6mpqUrVjB3XWT2NUCOMrs6Ojo4dEdE21LwJwy0s6TPA4YjYIal9RBVUICLWAmsB2traor195Ku8dtWPKuq/srWfO3cOe5eMu0aos1Y17l/aPqrlu7u7Gc1zrFZcZ/U0Qo0wdnWW86r8BPBZSZcBpwG/B9wNNEuaEBH9wAygN/XvBWYCByRNACYDr5e0DyhdxmxEZlW4YR9sZWt/xTsHA/avuXxU6zYbD8OO6UfETRExIyJmUXwQ+1hELAW2AVembp3AQ+nypjRNmv9YFGNIm4Cr09E9ZwKzgSerdkvMzGxYo3n/fSPQJek24BlgXWpfB3xbUg9whGJDQUTslvQAsAfoB1ZExDujWL+ZmVWootCPiG6gO11+iSGOvomIXwOfO8HytwO3V1qkmZlVh7+Ra2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWkWFDX9Jpkp6U9Jyk3ZK+ktrPlLRdUo+k70v6UGo/NU33pPmzSq7rptT+oqRLxuxWmZnZkMrZ038buDgizgfmAIslLQC+CtwVEWcBR4Flqf8y4Ghqvyv1Q9K5wNXAx4HFwDcknVLF22JmZsMYNvSj0JcmP5j+ArgY2JjaNwBXpMtL0jRp/kJJSu1dEfF2RLwM9ADzqnEjzMysPIqI4TsVe+Q7gLOArwP/CDyR9uaRNBP4cUScJ2kXsDgiDqR5PwPmA19Oy3wnta9Ly2wctK7lwHKAlpaWuV1dXSO+cTt7j1XUv2UiHDo+4tXVTCPU2Qg1wujqbJ0+ubrFnERfXx9NTU01W99INUKdjVAjjK7Ojo6OHRHRNtS8CeVcQUS8A8yR1Aw8CJwzokrKW9daYC1AW1tbtLe3j/i6rl31o4r6r2zt586dZd0l46oR6myEGmF0de5f2l7dYk6iu7ub0bwWaqUR6myEGmHs6qzo6J2IeAPYBlwENEsaeLXMAHrT5V5gJkCaPxl4vbR9iGXMzKwGyjl654y0h4+kicCngb0U4X9l6tYJPJQub0rTpPmPRTGGtAm4Oh3dcyYwG3iySrfDzMzKUM772mnAhjSu/wHggYjYLGkP0CXpNuAZYF3qvw74tqQe4AjFETtExG5JDwB7gH5gRRo2MjOzGhk29CPieeCCIdpfYoijbyLi18DnTnBdtwO3V16mmZlVg7+Ra2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZaT+z4hlVqdmVXhCv9FY2dr/7gkE96+5vGbrtfcf7+mbmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRoYNfUkzJW2TtEfSbknXp/apkrZI2pf+T0ntknSPpB5Jz0u6sOS6OlP/fZI6x+5mmZnZUMrZ0+8HVkbEucACYIWkc4FVwNaImA1sTdMAlwKz099y4F4oNhLAamA+MA9YPbChMDOz2hg29CPiYET8NF3+H2AvMB1YAmxI3TYAV6TLS4D7ovAE0CxpGnAJsCUijkTEUWALsLiaN8bMzE6uojF9SbOAC4DtQEtEHEyzXgNa0uXpwKslix1IbSdqNzOzGin7N3IlNQE/AG6IiF9JendeRISkqEZBkpZTDAvR0tJCd3f3iK9rZWt/Rf1bJla+zHhohDoboUZozDpH85oYa319fXVdHzRGjTB2dZYV+pI+SBH4342IH6bmQ5KmRcTBNHxzOLX3AjNLFp+R2nqB9kHt3YPXFRFrgbUAbW1t0d7ePrhL2a6t8IerV7b2c+fO+v+t+EaosxFqhMasc//S9vEt5iS6u7sZzWu2FhqhRhi7Oss5ekfAOmBvRHytZNYmYOAInE7goZL2a9JRPAuAY2kY6FFgkaQp6QPcRanNzMxqpJxdnE8AXwB2Sno2tf0tsAZ4QNIy4BXgqjTvYeAyoAd4C7gOICKOSLoVeCr1uyUijlTjRpiZWXmGDf2IeBzQCWYvHKJ/ACtOcF3rgfWVFGhmZtXjb+SamWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZqf+TjpjZe8yq8JxS1bR/zeXjtm6rDu/pm5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGRk29CWtl3RY0q6StqmStkjal/5PSe2SdI+kHknPS7qwZJnO1H+fpM6xuTlmZnYy5ezpfwtYPKhtFbA1ImYDW9M0wKXA7PS3HLgXio0EsBqYD8wDVg9sKMzMrHaGDf2I+A/gyKDmJcCGdHkDcEVJ+31ReAJoljQNuATYEhFHIuIosIXf3ZCYmdkYU0QM30maBWyOiPPS9BsR0ZwuCzgaEc2SNgNrIuLxNG8rcCPQDpwWEbel9puB4xFxxxDrWk7xLoGWlpa5XV1dI75xO3uPVdS/ZSIcOj7i1dVMI9TZCDWC66xU6/TJJ53f19dHU1NTjaoZmUaoEUZXZ0dHx46IaBtq3qh/GD0iQtLwW47yr28tsBagra0t2tvbR3xd11b4A9IrW/u5c2f9/1Z8I9TZCDWC66zU/qXtJ53f3d3NaF6ztdAINcLY1TnSo3cOpWEb0v/Dqb0XmFnSb0ZqO1G7mZnV0EhDfxMwcAROJ/BQSfs16SieBcCxiDgIPAoskjQlfYC7KLWZmVkNDft+UdL9FGPyp0s6QHEUzhrgAUnLgFeAq1L3h4HLgB7gLeA6gIg4IulW4KnU75aIGPzhsJmZjbFhQz8iPn+CWQuH6BvAihNcz3pgfUXVmZlZVfkbuWZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpaR8T9Bt5k1jFnD/EbFytb+in/Hohz711xe9evMlff0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIT8NgZnVvuNM/VKKSU0W8H0//4D19M7OMOPTNzDJS89CXtFjSi5J6JK2q9frNzHJW0zF9SacAXwc+DRwAnpK0KSL21LIOM7NyVPOzhEp9a/GkMbneWu/pzwN6IuKliPgN0AUsqXENZmbZUkTUbmXSlcDiiPjzNP0FYH5EfLGkz3JgeZo8G3ixZgXC6cAva7i+kWqEOhuhRnCd1dYIdTZCjTC6Ov8wIs4YakbdHbIZEWuBteOxbklPR0TbeKy7Eo1QZyPUCK6z2hqhzkaoEcauzloP7/QCM0umZ6Q2MzOrgVqH/lPAbElnSvoQcDWwqcY1mJllq6bDOxHRL+mLwKPAKcD6iNhdyxqGMS7DSiPQCHU2Qo3gOqutEepshBphjOqs6Qe5ZmY2vvyNXDOzjDj0zcwykmXoS5opaZukPZJ2S7o+tU+VtEXSvvR/yjjXeZqkJyU9l+r8Smo/U9L2dCqL76cPxcedpFMkPSNpc5quuzol7Ze0U9Kzkp5ObfX2uDdL2ijpBUl7JV1UhzWene7Dgb9fSbqh3upMtf5Vev3sknR/el3V1XNT0vWpvt2SbkhtY3JfZhn6QD+wMiLOBRYAKySdC6wCtkbEbGBrmh5PbwMXR8T5wBxgsaQFwFeBuyLiLOAosGz8SnyP64G9JdP1WmdHRMwpOQa63h73u4FHIuIc4HyK+7SuaoyIF9N9OAeYC7wFPEid1SlpOvCXQFtEnEdxAMnV1NFzU9J5wF9QnLHgfOAzks5irO7LiMj+D3iI4nxALwLTUts04MXxrq2kxg8DPwXmU3xLb0Jqvwh4tA7qm5GemBcDmwHVaZ37gdMHtdXN4w5MBl4mHWRRjzUOUfMi4L/qsU5gOvAqMJXiaMXNwCX19NwEPgesK5m+Gfibsbovc93Tf5ekWcAFwHagJSIOplmvAS3jVdeANGTyLHAY2AL8DHgjIvpTlwMUT+zx9k8UT9T/S9MfpT7rDOAnknakU35AfT3uZwK/AP4lDZV9U9Ik6qvGwa4G7k+X66rOiOgF7gB+DhwEjgE7qK/n5i7gTyR9VNKHgcsovsQ6Jvdl1qEvqQn4AXBDRPyqdF4Um9dxP541It6J4i30DIq3f+eMb0W/S9JngMMRsWO8aynDJyPiQuBSimG9T5XOrIPHfQJwIXBvRFwAvMmgt/V1UOO70lj4Z4F/HTyvHupM4+BLKDamvw9MAhaPZ02DRcReiuGmnwCPAM8C7wzqU7X7MtvQl/RBisD/bkT8MDUfkjQtzZ9GsXddFyLiDWAbxVvRZkkDX6yrh1NZfAL4rKT9FGdOvZhiXLre6hzY8yMiDlOMQc+jvh73A8CBiNiepjdSbATqqcZSlwI/jYhDabre6vxT4OWI+EVE/C/wQ4rna109NyNiXUTMjYhPUXzG8N+M0X2ZZehLErAO2BsRXyuZtQnoTJc7Kcb6x42kMyQ1p8sTKT532EsR/lembuNeZ0TcFBEzImIWxVv9xyJiKXVWp6RJkj4ycJliLHoXdfS4R8RrwKuSzk5NC4E91FGNg3ye3w7tQP3V+XNggaQPp9f9wP1Zb8/Nj6X/fwD8GfA9xuq+HM8PWcbxg5NPUrxVep7irdSzFONoH6X4MHIf8O/A1HGu84+BZ1Kdu4C/T+1/BDwJ9FC8rT51vO/Tkprbgc31WGeq57n0txv4u9Reb4/7HODp9Lj/GzCl3mpMdU4CXgcml7TVY51fAV5Ir6FvA6fW4XPzPyk2Rs8BC8fyvvRpGMzMMpLl8I6ZWa4c+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5ll5P8B0zpbKcr8gKEAAAAASUVORK5CYII=\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"income_df['age'].hist(bins=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(L2-Q2)=\n",
"### Q2: 出現頻度\n",
"機械学習モデルを構築する前に,`income_df`データに含まれる性別,年収の分布を知りたい.\n",
"性別(男,女),年収(50K以上,50K未満)について,属性値に対応する人数を求めよ.\n",
"\n",
"※ ヒント: 要素の出現頻度を求めるには`pandas.series.value_counts`メソッドを用いるとよい([参考](https://note.nkmk.me/python-pandas-value-counts/))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" Male 21790\n",
" Female 10771\n",
"Name: sex, dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"income_df['sex'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" <=50K 24720\n",
" >50K 7841\n",
"Name: income, dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"income_df['income'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"(L2-Q3)=\n",
"### Q3: データの集約\n",
"``income_df``データを集約し,学歴ごとに年間収入クラスの内訳(割合)を調べよ.\n",
"\n",
"※ ヒント: pandasの[crosstab](https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html)関数を使う(タイタニックの例でも使ったので,確認してみよう)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" income | \n",
" <=50K | \n",
" >50K | \n",
"
\n",
" \n",
" education | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 10th | \n",
" 0.933548 | \n",
" 0.066452 | \n",
"
\n",
" \n",
" 11th | \n",
" 0.948936 | \n",
" 0.051064 | \n",
"
\n",
" \n",
" 12th | \n",
" 0.923788 | \n",
" 0.076212 | \n",
"
\n",
" \n",
" 1st-4th | \n",
" 0.964286 | \n",
" 0.035714 | \n",
"
\n",
" \n",
" 5th-6th | \n",
" 0.951952 | \n",
" 0.048048 | \n",
"
\n",
" \n",
" 7th-8th | \n",
" 0.938080 | \n",
" 0.061920 | \n",
"
\n",
" \n",
" 9th | \n",
" 0.947471 | \n",
" 0.052529 | \n",
"
\n",
" \n",
" Assoc-acdm | \n",
" 0.751640 | \n",
" 0.248360 | \n",
"
\n",
" \n",
" Assoc-voc | \n",
" 0.738784 | \n",
" 0.261216 | \n",
"
\n",
" \n",
" Bachelors | \n",
" 0.585247 | \n",
" 0.414753 | \n",
"
\n",
" \n",
" Doctorate | \n",
" 0.259080 | \n",
" 0.740920 | \n",
"
\n",
" \n",
" HS-grad | \n",
" 0.840491 | \n",
" 0.159509 | \n",
"
\n",
" \n",
" Masters | \n",
" 0.443413 | \n",
" 0.556587 | \n",
"
\n",
" \n",
" Preschool | \n",
" 1.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" Prof-school | \n",
" 0.265625 | \n",
" 0.734375 | \n",
"
\n",
" \n",
" Some-college | \n",
" 0.809765 | \n",
" 0.190235 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"income <=50K >50K\n",
"education \n",
" 10th 0.933548 0.066452\n",
" 11th 0.948936 0.051064\n",
" 12th 0.923788 0.076212\n",
" 1st-4th 0.964286 0.035714\n",
" 5th-6th 0.951952 0.048048\n",
" 7th-8th 0.938080 0.061920\n",
" 9th 0.947471 0.052529\n",
" Assoc-acdm 0.751640 0.248360\n",
" Assoc-voc 0.738784 0.261216\n",
" Bachelors 0.585247 0.414753\n",
" Doctorate 0.259080 0.740920\n",
" HS-grad 0.840491 0.159509\n",
" Masters 0.443413 0.556587\n",
" Preschool 1.000000 0.000000\n",
" Prof-school 0.265625 0.734375\n",
" Some-college 0.809765 0.190235"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(income_df['education'], income_df['income'], normalize='index')"
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"(L2-Q4)=\n",
"### Q4: 学習のためのデータ分割\n",
"`income_df`データに決定木アルゴリズムを適用するために,データを7:3に分割し,7割のデータを学習用データ(`income_train_df`),3割のデータを評価用データ(`income_test_df`)としなさい."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# データを学習用(70%)と評価用(30%)に分割する\n",
"income_train_df, income_test_df = train_test_split(\n",
" income_df,\n",
" test_size=0.3,\n",
" random_state=1,\n",
" stratify=income_df.income)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(L2-Q5)=\n",
"### Q5: 決定木の構築\n",
"\n",
"以下は,「年齢」「雇用形態」「学歴」「婚姻の有無」「職業」「家族内における役割」「人種」「性別」「週あたりの労働時間」「出身国」の属性に着目して,`income_df`データから年収カテゴリを予測する決定木を構築するコードである.\n",
"`# ---------- ` の間を埋めてコードを完成させなさい."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DecisionTreeClassifier(criterion='entropy', max_depth=5)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 注目する属性\n",
"target_features = ['age', 'workclass', 'education', 'marital-status', 'occupation', \n",
" 'relationship', 'race', 'sex', 'hours-per-week', 'native-country']\n",
"\n",
"# 数値に変換したいカテゴリ変数\n",
"encoded_features = ['education', 'workclass', 'marital-status', 'relationship', 'occupation', 'native-country', 'race', 'sex']\n",
"\n",
"# カテゴリ変数を数値情報に変換する\n",
"encoder = category_encoders.OneHotEncoder(cols=encoded_features, use_cat_names=True)\n",
"encoder.fit(income_train_df[target_features])\n",
"\n",
"# ---------------------\n",
"# ここから必要なコードを埋める\n",
"\n",
"# 予測に用いる年収情報以外のすべての指標をX_trainに\n",
"X_train = income_train_df[target_features]\n",
"\n",
"# カテゴリ変数を数値情報に変換\n",
"X_train = encoder.transform(X_train)\n",
"\n",
"# y_trainは年収クラスをあらわす指標\n",
"y_train = income_train_df.income\n",
"\n",
"# 学習器の定義(決定木を使う)\n",
"model = DecisionTreeClassifier(criterion='entropy',\n",
" max_depth=5)\n",
"\n",
"# ここまで必要なコードを埋める\n",
"# ---------------------\n",
"\n",
"# 学習用データを使って学習\n",
"model.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(L2-Q6)=\n",
"### Q6: 決定木における各属性の寄与度\n",
"構築した決定木モデル(`model`)を用いて,年収(`income`)の分類における各属性(列)の寄与度を表示しなさい.\n",
"なお,寄与度がゼロのものは表示しなくてよい."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"age\t0.0894604682456178\n",
"workclass_ State-gov\t0.000640985686209342\n",
"education_ Bachelors\t0.06141563902135156\n",
"education_ Some-college\t0.006168600381132655\n",
"education_ Prof-school\t0.003625407861733974\n",
"education_ HS-grad\t0.00881490359411402\n",
"education_ Masters\t0.010774717032624306\n",
"marital-status_ Married-civ-spouse\t0.5943231227829802\n",
"occupation_ Prof-specialty\t0.07941724495727506\n",
"occupation_ Exec-managerial\t0.0643690097067649\n",
"occupation_ Protective-serv\t0.0011395499411790816\n",
"sex_ Male\t0.0020402074457221034\n",
"sex_ Female\t0.0020803940555122083\n",
"hours-per-week\t0.07499207892598281\n",
"native-country_ Portugal\t0.0007376703617999955\n"
]
}
],
"source": [
"for feature, importance in zip(X_train.columns, model.feature_importances_):\n",
" if importance > 0:\n",
" print(\"{}\\t{}\".format(feature, importance))"
]
},
{
"cell_type": "markdown",
"metadata": {
"user_expressions": []
},
"source": [
"(L2-Q7)=\n",
"### Q7: 決定木の再構築\n",
"Q6の結果をもとに年収分類に寄与する特徴量を(最大5つ)特定し,その特徴量のみを用いて再度決定木モデルを構築しなさい.\n",
"その際,あまり木が深くならないよう調整し,できる限りシンプルなモデルになるようにすること."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DecisionTreeClassifier(criterion='entropy', max_depth=3)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"target_features = ['age', 'education', 'marital-status', 'occupation', 'hours-per-week']\n",
"encoded_features = ['education', 'marital-status', 'occupation']\n",
"\n",
"encoder = category_encoders.OneHotEncoder(cols=encoded_features, use_cat_names=True)\n",
"encoder.fit(income_train_df[target_features])\n",
"\n",
"X_train = income_train_df[target_features]\n",
"X_train = encoder.transform(X_train)\n",
"y_train = income_train_df.income\n",
"\n",
"# 学習\n",
"model = DecisionTreeClassifier(criterion='entropy', max_depth=3)\n",
"model.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Source(export_graphviz(\n",
" model, out_file=None,\n",
" feature_names=X_train.columns,\n",
" class_names=['<=50K', '>50K'],\n",
" proportion=True,\n",
" filled=True, rounded=True # 見た目の調整\n",
"))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 4
}