{ "cells": [ { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "# 決定木から始める機械学習" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "hidden-input", "hidden-output", "remove-input", "remove-output" ] }, "outputs": [], "source": [ "try:\n", " import category_encoders\n", " import graphviz\n", "except:\n", " !pip install graphviz\n", " !pip install category_encoders" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# 表形式のデータを操作するためのライブラリ\n", "import pandas as pd\n", "\n", "# 機械学習用ライブラリsklearn\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.metrics import accuracy_score\n", "from sklearn.tree import export_graphviz\n", "\n", "# その他\n", "import category_encoders\n", "\n", "# グラフ描画ライブラリ\n", "from graphviz import Source\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "\n", "---" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "## クイズ" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "以下のコードを実行して`income_df`に格納されるデータは,ある年にアメリカで実施された国勢調査のデータである." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countryincome
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
\n", "
" ], "text/plain": [ " age workclass fnlwgt education education-num \\\n", "0 39 State-gov 77516 Bachelors 13 \n", "1 50 Self-emp-not-inc 83311 Bachelors 13 \n", "2 38 Private 215646 HS-grad 9 \n", "3 53 Private 234721 11th 7 \n", "4 28 Private 338409 Bachelors 13 \n", "\n", " marital-status occupation relationship race sex \\\n", "0 Never-married Adm-clerical Not-in-family White Male \n", "1 Married-civ-spouse Exec-managerial Husband White Male \n", "2 Divorced Handlers-cleaners Not-in-family White Male \n", "3 Married-civ-spouse Handlers-cleaners Husband Black Male \n", "4 Married-civ-spouse Prof-specialty Wife Black Female \n", "\n", " capital-gain capital-loss hours-per-week native-country income \n", "0 2174 0 40 United-States <=50K \n", "1 0 0 13 United-States <=50K \n", "2 0 0 40 United-States <=50K \n", "3 0 0 40 United-States <=50K \n", "4 0 0 40 Cuba <=50K " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# データの読み込み\n", "income_df = pd.read_table(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data\", sep=',', header=None)\n", "\n", "# 列名(特徴)に名前を付ける\n", "income_df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', \n", " 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']\n", "\n", "# データ表示(先頭5件)\n", "income_df.head()" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "データ中の列名(特徴量)の意味は以下の通りである:\n", "\n", "* age: 年齢(整数)\n", "* workclass: 雇用形態(公務員,会社員など)\n", "* fnlwgt: 使わない\n", "* education: 学歴\n", "* education-num: 使わない\n", "* marital-status: 婚姻状態\n", "* occupation: 職業\n", "* relationship: 家族内における役割\n", "* race: 人種\n", "* sex: 性別\n", "* capital-gain: 使わない\n", "* capital-loss: 使わない\n", "* hours-per-week: 週あたりの労働時間(整数値)\n", "* native-country: 出身国\n", "* income: 年収(50Kドル以上,50Kドル未満の二値)\n", "\n", "このデータに対して決定木アルゴリズムを適用して,ある人物が年間収入が50Kドル以上か未満かを分類する機械学習モデルを構築したい." ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "(L2-Q1)=\n", "### Q1: ヒストグラム \n", "機械学習モデルを構築する前に,`income_df`データに含まれる調査対象者の年齢の分布を知りたい.\n", "年齢に関するヒストグラム(階級数は10)を作成せよ.\n", "\n", "※ ヒント: ヒストグラムの作成には`pandas.series.hist`関数を用いるとよい([参考](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAAWH0lEQVR4nO3dfYxc1X3G8e8TnIDjTb12ICvXdmsqLBBhi8Er2yhptIsbYyCKUUUQkRUW5Nb/OClUroppRZ3wojoqhIKUoFqxG5MXNtQJxXIIxDVetVTCgMOL36DegAleGTvBxumCQ7r01z/uWTJs1t6Z3dnZGc7zkVY799xz5/7m7bl3zty5o4jAzMzy8IHxLsDMzGrHoW9mlhGHvplZRhz6ZmYZceibmWVkwngXcDKnn356zJo1q2bre/PNN5k0aVLN1jdSjVBnI9QIrrPaGqHORqgRRlfnjh07fhkRZww5MyLq9m/u3LlRS9u2bavp+kaqEepshBojXGe1NUKdjVBjxOjqBJ6OE+Sqh3fMzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDJS16dhsMYxa9WP3jO9srWfawe1jYX9ay4f83WYvZ94T9/MLCNlhb6kZkkbJb0gaa+kiyRNlbRF0r70f0rqK0n3SOqR9LykC0uupzP13yepc6xulJmZDa3cPf27gUci4hzgfGAvsArYGhGzga1pGuBSYHb6Ww7cCyBpKrAamA/MA1YPbCjMzKw2hh3TlzQZ+BRwLUBE/Ab4jaQlQHvqtgHoBm4ElgD3pTO9PZHeJUxLfbdExJF0vVuAxcD91bs5eRs8rm5mNpiKbD5JB2kOsBbYQ7GXvwO4HuiNiObUR8DRiGiWtBlYExGPp3lbKTYG7cBpEXFbar8ZOB4Rdwxa33KKdwi0tLTM7erqqsoNLUdfXx9NTU01W99InajOnb3HxqGaobVMhEPHx349rdMnj2r5Rn/M600j1NkINcLo6uzo6NgREW1DzSvn6J0JwIXAlyJiu6S7+e1QDgAREZJOvvUoU0SspdjI0NbWFu3t7dW42rJ0d3dTy/WN1InqrMXRMuVa2drPnTvH/uCw/UvbR7V8oz/m9aYR6myEGmHs6ixnTP8AcCAitqfpjRQbgUNp2Ib0/3Ca3wvMLFl+Rmo7UbuZmdXIsKEfEa8Br0o6OzUtpBjq2QQMHIHTCTyULm8CrklH8SwAjkXEQeBRYJGkKekD3EWpzczMaqTc999fAr4r6UPAS8B1FBuMByQtA14Brkp9HwYuA3qAt1JfIuKIpFuBp1K/WwY+1H2/GesPVGv1xScze/8pK/Qj4llgqA8FFg7RN4AVJ7ie9cD6CuozM7Mq8jdyzcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy8r7+jdxKT4fg0xuY2fud9/TNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsI2WFvqT9knZKelbS06ltqqQtkval/1NSuyTdI6lH0vOSLiy5ns7Uf5+kzrG5SWZmdiKV7Ol3RMSciGhL06uArRExG9iapgEuBWanv+XAvVBsJIDVwHxgHrB6YENhZma1MZrhnSXAhnR5A3BFSft9UXgCaJY0DbgE2BIRRyLiKLAFWDyK9ZuZWYUUEcN3kl4GjgIB/HNErJX0RkQ0p/kCjkZEs6TNwJqIeDzN2wrcCLQDp0XEban9ZuB4RNwxaF3LKd4h0NLSMrerq2vEN25n77GK+rdMhEPHR7y6mmmEOmtVY+v0yaNavq+vj6ampipVM3ZcZ/U0Qo0wujo7Ojp2lIzKvEe5v5H7yYjolfQxYIukF0pnRkRIGn7rUYaIWAusBWhra4v29vYRX1elv3e7srWfO3fW/88GN0Kdtapx/9L2US3f3d3NaJ5jteI6q6cRaoSxq7Os4Z2I6E3/DwMPUozJH0rDNqT/h1P3XmBmyeIzUtuJ2s3MrEaGDX1JkyR9ZOAysAjYBWwCBo7A6QQeSpc3Adeko3gWAMci4iDwKLBI0pT0Ae6i1GZmZjVSzvvvFuDBYtieCcD3IuIRSU8BD0haBrwCXJX6PwxcBvQAbwHXAUTEEUm3Ak+lfrdExJGq3RIzMxvWsKEfES8B5w/R/jqwcIj2AFac4LrWA+srL9PMzKrB38g1M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwyUnboSzpF0jOSNqfpMyVtl9Qj6fuSPpTaT03TPWn+rJLruCm1vyjpkqrfGjMzO6lK9vSvB/aWTH8VuCsizgKOAstS+zLgaGq/K/VD0rnA1cDHgcXANySdMrryzcysEmWFvqQZwOXAN9O0gIuBjanLBuCKdHlJmibNX5j6LwG6IuLtiHgZ6AHmVeE2mJlZmRQRw3eSNgL/AHwE+GvgWuCJtDePpJnAjyPiPEm7gMURcSDN+xkwH/hyWuY7qX1dWmbjoHUtB5YDtLS0zO3q6hrxjdvZe6yi/i0T4dDxEa+uZhqhzlrV2Dp98qiW7+vro6mpqUrVjB3XWT2NUCOMrs6Ojo4dEdE21LwJwy0s6TPA4YjYIal9RBVUICLWAmsB2traor195Ku8dtWPKuq/srWfO3cOe5eMu0aos1Y17l/aPqrlu7u7Gc1zrFZcZ/U0Qo0wdnWW86r8BPBZSZcBpwG/B9wNNEuaEBH9wAygN/XvBWYCByRNACYDr5e0DyhdxmxEZlW4YR9sZWt/xTsHA/avuXxU6zYbD8OO6UfETRExIyJmUXwQ+1hELAW2AVembp3AQ+nypjRNmv9YFGNIm4Cr09E9ZwKzgSerdkvMzGxYo3n/fSPQJek24BlgXWpfB3xbUg9whGJDQUTslvQAsAfoB1ZExDujWL+ZmVWootCPiG6gO11+iSGOvomIXwOfO8HytwO3V1qkmZlVh7+Ra2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWkWFDX9Jpkp6U9Jyk3ZK+ktrPlLRdUo+k70v6UGo/NU33pPmzSq7rptT+oqRLxuxWmZnZkMrZ038buDgizgfmAIslLQC+CtwVEWcBR4Flqf8y4Ghqvyv1Q9K5wNXAx4HFwDcknVLF22JmZsMYNvSj0JcmP5j+ArgY2JjaNwBXpMtL0jRp/kJJSu1dEfF2RLwM9ADzqnEjzMysPIqI4TsVe+Q7gLOArwP/CDyR9uaRNBP4cUScJ2kXsDgiDqR5PwPmA19Oy3wnta9Ly2wctK7lwHKAlpaWuV1dXSO+cTt7j1XUv2UiHDo+4tXVTCPU2Qg1wujqbJ0+ubrFnERfXx9NTU01W99INUKdjVAjjK7Ojo6OHRHRNtS8CeVcQUS8A8yR1Aw8CJwzokrKW9daYC1AW1tbtLe3j/i6rl31o4r6r2zt586dZd0l46oR6myEGmF0de5f2l7dYk6iu7ub0bwWaqUR6myEGmHs6qzo6J2IeAPYBlwENEsaeLXMAHrT5V5gJkCaPxl4vbR9iGXMzKwGyjl654y0h4+kicCngb0U4X9l6tYJPJQub0rTpPmPRTGGtAm4Oh3dcyYwG3iySrfDzMzKUM772mnAhjSu/wHggYjYLGkP0CXpNuAZYF3qvw74tqQe4AjFETtExG5JDwB7gH5gRRo2MjOzGhk29CPieeCCIdpfYoijbyLi18DnTnBdtwO3V16mmZlVg7+Ra2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZaT+z4hlVqdmVXhCv9FY2dr/7gkE96+5vGbrtfcf7+mbmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRoYNfUkzJW2TtEfSbknXp/apkrZI2pf+T0ntknSPpB5Jz0u6sOS6OlP/fZI6x+5mmZnZUMrZ0+8HVkbEucACYIWkc4FVwNaImA1sTdMAlwKz099y4F4oNhLAamA+MA9YPbChMDOz2hg29CPiYET8NF3+H2AvMB1YAmxI3TYAV6TLS4D7ovAE0CxpGnAJsCUijkTEUWALsLiaN8bMzE6uojF9SbOAC4DtQEtEHEyzXgNa0uXpwKslix1IbSdqNzOzGin7N3IlNQE/AG6IiF9JendeRISkqEZBkpZTDAvR0tJCd3f3iK9rZWt/Rf1bJla+zHhohDoboUZozDpH85oYa319fXVdHzRGjTB2dZYV+pI+SBH4342IH6bmQ5KmRcTBNHxzOLX3AjNLFp+R2nqB9kHt3YPXFRFrgbUAbW1t0d7ePrhL2a6t8IerV7b2c+fO+v+t+EaosxFqhMasc//S9vEt5iS6u7sZzWu2FhqhRhi7Oss5ekfAOmBvRHytZNYmYOAInE7goZL2a9JRPAuAY2kY6FFgkaQp6QPcRanNzMxqpJxdnE8AXwB2Sno2tf0tsAZ4QNIy4BXgqjTvYeAyoAd4C7gOICKOSLoVeCr1uyUijlTjRpiZWXmGDf2IeBzQCWYvHKJ/ACtOcF3rgfWVFGhmZtXjb+SamWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZqf+TjpjZe8yq8JxS1bR/zeXjtm6rDu/pm5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGRk29CWtl3RY0q6StqmStkjal/5PSe2SdI+kHknPS7qwZJnO1H+fpM6xuTlmZnYy5ezpfwtYPKhtFbA1ImYDW9M0wKXA7PS3HLgXio0EsBqYD8wDVg9sKMzMrHaGDf2I+A/gyKDmJcCGdHkDcEVJ+31ReAJoljQNuATYEhFHIuIosIXf3ZCYmdkYU0QM30maBWyOiPPS9BsR0ZwuCzgaEc2SNgNrIuLxNG8rcCPQDpwWEbel9puB4xFxxxDrWk7xLoGWlpa5XV1dI75xO3uPVdS/ZSIcOj7i1dVMI9TZCDWC66xU6/TJJ53f19dHU1NTjaoZmUaoEUZXZ0dHx46IaBtq3qh/GD0iQtLwW47yr28tsBagra0t2tvbR3xd11b4A9IrW/u5c2f9/1Z8I9TZCDWC66zU/qXtJ53f3d3NaF6ztdAINcLY1TnSo3cOpWEb0v/Dqb0XmFnSb0ZqO1G7mZnV0EhDfxMwcAROJ/BQSfs16SieBcCxiDgIPAoskjQlfYC7KLWZmVkNDft+UdL9FGPyp0s6QHEUzhrgAUnLgFeAq1L3h4HLgB7gLeA6gIg4IulW4KnU75aIGPzhsJmZjbFhQz8iPn+CWQuH6BvAihNcz3pgfUXVmZlZVfkbuWZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpaR8T9Bt5k1jFnD/EbFytb+in/Hohz711xe9evMlff0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIT8NgZnVvuNM/VKKSU0W8H0//4D19M7OMOPTNzDJS89CXtFjSi5J6JK2q9frNzHJW0zF9SacAXwc+DRwAnpK0KSL21LIOM7NyVPOzhEp9a/GkMbneWu/pzwN6IuKliPgN0AUsqXENZmbZUkTUbmXSlcDiiPjzNP0FYH5EfLGkz3JgeZo8G3ixZgXC6cAva7i+kWqEOhuhRnCd1dYIdTZCjTC6Ov8wIs4YakbdHbIZEWuBteOxbklPR0TbeKy7Eo1QZyPUCK6z2hqhzkaoEcauzloP7/QCM0umZ6Q2MzOrgVqH/lPAbElnSvoQcDWwqcY1mJllq6bDOxHRL+mLwKPAKcD6iNhdyxqGMS7DSiPQCHU2Qo3gOqutEepshBphjOqs6Qe5ZmY2vvyNXDOzjDj0zcwykmXoS5opaZukPZJ2S7o+tU+VtEXSvvR/yjjXeZqkJyU9l+r8Smo/U9L2dCqL76cPxcedpFMkPSNpc5quuzol7Ze0U9Kzkp5ObfX2uDdL2ijpBUl7JV1UhzWene7Dgb9fSbqh3upMtf5Vev3sknR/el3V1XNT0vWpvt2SbkhtY3JfZhn6QD+wMiLOBRYAKySdC6wCtkbEbGBrmh5PbwMXR8T5wBxgsaQFwFeBuyLiLOAosGz8SnyP64G9JdP1WmdHRMwpOQa63h73u4FHIuIc4HyK+7SuaoyIF9N9OAeYC7wFPEid1SlpOvCXQFtEnEdxAMnV1NFzU9J5wF9QnLHgfOAzks5irO7LiMj+D3iI4nxALwLTUts04MXxrq2kxg8DPwXmU3xLb0Jqvwh4tA7qm5GemBcDmwHVaZ37gdMHtdXN4w5MBl4mHWRRjzUOUfMi4L/qsU5gOvAqMJXiaMXNwCX19NwEPgesK5m+Gfibsbovc93Tf5ekWcAFwHagJSIOplmvAS3jVdeANGTyLHAY2AL8DHgjIvpTlwMUT+zx9k8UT9T/S9MfpT7rDOAnknakU35AfT3uZwK/AP4lDZV9U9Ik6qvGwa4G7k+X66rOiOgF7gB+DhwEjgE7qK/n5i7gTyR9VNKHgcsovsQ6Jvdl1qEvqQn4AXBDRPyqdF4Um9dxP541It6J4i30DIq3f+eMb0W/S9JngMMRsWO8aynDJyPiQuBSimG9T5XOrIPHfQJwIXBvRFwAvMmgt/V1UOO70lj4Z4F/HTyvHupM4+BLKDamvw9MAhaPZ02DRcReiuGmnwCPAM8C7wzqU7X7MtvQl/RBisD/bkT8MDUfkjQtzZ9GsXddFyLiDWAbxVvRZkkDX6yrh1NZfAL4rKT9FGdOvZhiXLre6hzY8yMiDlOMQc+jvh73A8CBiNiepjdSbATqqcZSlwI/jYhDabre6vxT4OWI+EVE/C/wQ4rna109NyNiXUTMjYhPUXzG8N+M0X2ZZehLErAO2BsRXyuZtQnoTJc7Kcb6x42kMyQ1p8sTKT532EsR/lembuNeZ0TcFBEzImIWxVv9xyJiKXVWp6RJkj4ycJliLHoXdfS4R8RrwKuSzk5NC4E91FGNg3ye3w7tQP3V+XNggaQPp9f9wP1Zb8/Nj6X/fwD8GfA9xuq+HM8PWcbxg5NPUrxVep7irdSzFONoH6X4MHIf8O/A1HGu84+BZ1Kdu4C/T+1/BDwJ9FC8rT51vO/Tkprbgc31WGeq57n0txv4u9Reb4/7HODp9Lj/GzCl3mpMdU4CXgcml7TVY51fAV5Ir6FvA6fW4XPzPyk2Rs8BC8fyvvRpGMzMMpLl8I6ZWa4c+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5ll5P8B0zpbKcr8gKEAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "income_df['age'].hist(bins=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(L2-Q2)=\n", "### Q2: 出現頻度\n", "機械学習モデルを構築する前に,`income_df`データに含まれる性別,年収の分布を知りたい.\n", "性別(男,女),年収(50K以上,50K未満)について,属性値に対応する人数を求めよ.\n", "\n", "※ ヒント: 要素の出現頻度を求めるには`pandas.series.value_counts`メソッドを用いるとよい([参考](https://note.nkmk.me/python-pandas-value-counts/))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " Male 21790\n", " Female 10771\n", "Name: sex, dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "income_df['sex'].value_counts()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " <=50K 24720\n", " >50K 7841\n", "Name: income, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "income_df['income'].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "(L2-Q3)=\n", "### Q3: データの集約\n", "``income_df``データを集約し,学歴ごとに年間収入クラスの内訳(割合)を調べよ.\n", "\n", "※ ヒント: pandasの[crosstab](https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html)関数を使う(タイタニックの例でも使ったので,確認してみよう)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
income<=50K>50K
education
10th0.9335480.066452
11th0.9489360.051064
12th0.9237880.076212
1st-4th0.9642860.035714
5th-6th0.9519520.048048
7th-8th0.9380800.061920
9th0.9474710.052529
Assoc-acdm0.7516400.248360
Assoc-voc0.7387840.261216
Bachelors0.5852470.414753
Doctorate0.2590800.740920
HS-grad0.8404910.159509
Masters0.4434130.556587
Preschool1.0000000.000000
Prof-school0.2656250.734375
Some-college0.8097650.190235
\n", "
" ], "text/plain": [ "income <=50K >50K\n", "education \n", " 10th 0.933548 0.066452\n", " 11th 0.948936 0.051064\n", " 12th 0.923788 0.076212\n", " 1st-4th 0.964286 0.035714\n", " 5th-6th 0.951952 0.048048\n", " 7th-8th 0.938080 0.061920\n", " 9th 0.947471 0.052529\n", " Assoc-acdm 0.751640 0.248360\n", " Assoc-voc 0.738784 0.261216\n", " Bachelors 0.585247 0.414753\n", " Doctorate 0.259080 0.740920\n", " HS-grad 0.840491 0.159509\n", " Masters 0.443413 0.556587\n", " Preschool 1.000000 0.000000\n", " Prof-school 0.265625 0.734375\n", " Some-college 0.809765 0.190235" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(income_df['education'], income_df['income'], normalize='index')" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "(L2-Q4)=\n", "### Q4: 学習のためのデータ分割\n", "`income_df`データに決定木アルゴリズムを適用するために,データを7:3に分割し,7割のデータを学習用データ(`income_train_df`),3割のデータを評価用データ(`income_test_df`)としなさい." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# データを学習用(70%)と評価用(30%)に分割する\n", "income_train_df, income_test_df = train_test_split(\n", " income_df,\n", " test_size=0.3,\n", " random_state=1,\n", " stratify=income_df.income)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(L2-Q5)=\n", "### Q5: 決定木の構築\n", "\n", "以下は,「年齢」「雇用形態」「学歴」「婚姻の有無」「職業」「家族内における役割」「人種」「性別」「週あたりの労働時間」「出身国」の属性に着目して,`income_df`データから年収カテゴリを予測する決定木を構築するコードである.\n", "`# ---------- ` の間を埋めてコードを完成させなさい." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier(criterion='entropy', max_depth=5)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 注目する属性\n", "target_features = ['age', 'workclass', 'education', 'marital-status', 'occupation', \n", " 'relationship', 'race', 'sex', 'hours-per-week', 'native-country']\n", "\n", "# 数値に変換したいカテゴリ変数\n", "encoded_features = ['education', 'workclass', 'marital-status', 'relationship', 'occupation', 'native-country', 'race', 'sex']\n", "\n", "# カテゴリ変数を数値情報に変換する\n", "encoder = category_encoders.OneHotEncoder(cols=encoded_features, use_cat_names=True)\n", "encoder.fit(income_train_df[target_features])\n", "\n", "# ---------------------\n", "# ここから必要なコードを埋める\n", "\n", "# 予測に用いる年収情報以外のすべての指標をX_trainに\n", "X_train = income_train_df[target_features]\n", "\n", "# カテゴリ変数を数値情報に変換\n", "X_train = encoder.transform(X_train)\n", "\n", "# y_trainは年収クラスをあらわす指標\n", "y_train = income_train_df.income\n", "\n", "# 学習器の定義(決定木を使う)\n", "model = DecisionTreeClassifier(criterion='entropy',\n", " max_depth=5)\n", "\n", "# ここまで必要なコードを埋める\n", "# ---------------------\n", "\n", "# 学習用データを使って学習\n", "model.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(L2-Q6)=\n", "### Q6: 決定木における各属性の寄与度\n", "構築した決定木モデル(`model`)を用いて,年収(`income`)の分類における各属性(列)の寄与度を表示しなさい.\n", "なお,寄与度がゼロのものは表示しなくてよい." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "age\t0.0894604682456178\n", "workclass_ State-gov\t0.000640985686209342\n", "education_ Bachelors\t0.06141563902135156\n", "education_ Some-college\t0.006168600381132655\n", "education_ Prof-school\t0.003625407861733974\n", "education_ HS-grad\t0.00881490359411402\n", "education_ Masters\t0.010774717032624306\n", "marital-status_ Married-civ-spouse\t0.5943231227829802\n", "occupation_ Prof-specialty\t0.07941724495727506\n", "occupation_ Exec-managerial\t0.0643690097067649\n", "occupation_ Protective-serv\t0.0011395499411790816\n", "sex_ Male\t0.0020402074457221034\n", "sex_ Female\t0.0020803940555122083\n", "hours-per-week\t0.07499207892598281\n", "native-country_ Portugal\t0.0007376703617999955\n" ] } ], "source": [ "for feature, importance in zip(X_train.columns, model.feature_importances_):\n", " if importance > 0:\n", " print(\"{}\\t{}\".format(feature, importance))" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "(L2-Q7)=\n", "### Q7: 決定木の再構築\n", "Q6の結果をもとに年収分類に寄与する特徴量を(最大5つ)特定し,その特徴量のみを用いて再度決定木モデルを構築しなさい.\n", "その際,あまり木が深くならないよう調整し,できる限りシンプルなモデルになるようにすること." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier(criterion='entropy', max_depth=3)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target_features = ['age', 'education', 'marital-status', 'occupation', 'hours-per-week']\n", "encoded_features = ['education', 'marital-status', 'occupation']\n", "\n", "encoder = category_encoders.OneHotEncoder(cols=encoded_features, use_cat_names=True)\n", "encoder.fit(income_train_df[target_features])\n", "\n", "X_train = income_train_df[target_features]\n", "X_train = encoder.transform(X_train)\n", "y_train = income_train_df.income\n", "\n", "# 学習\n", "model = DecisionTreeClassifier(criterion='entropy', max_depth=3)\n", "model.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Tree\n", "\n", "\n", "\n", "0\n", "\n", "marital-status_ Married-civ-spouse <= 0.5\n", "entropy = 0.796\n", "samples = 100.0%\n", "value = [0.759, 0.241]\n", "class = <=50K\n", "\n", "\n", "\n", "1\n", "\n", "hours-per-week <= 42.5\n", "entropy = 0.353\n", "samples = 54.2%\n", "value = [0.933, 0.067]\n", "class = <=50K\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "True\n", "\n", "\n", "\n", "8\n", "\n", "education_ Bachelors <= 0.5\n", "entropy = 0.992\n", "samples = 45.8%\n", "value = [0.553, 0.447]\n", "class = <=50K\n", "\n", "\n", "\n", "0->8\n", "\n", "\n", "False\n", "\n", "\n", "\n", "2\n", "\n", "age <= 28.5\n", "entropy = 0.229\n", "samples = 42.7%\n", "value = [0.963, 0.037]\n", "class = <=50K\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "5\n", "\n", "age <= 27.5\n", "entropy = 0.671\n", "samples = 11.5%\n", "value = [0.824, 0.176]\n", "class = <=50K\n", "\n", "\n", "\n", "1->5\n", "\n", "\n", "\n", "\n", "\n", "3\n", "\n", "entropy = 0.062\n", "samples = 19.5%\n", "value = [0.993, 0.007]\n", "class = <=50K\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "4\n", "\n", "entropy = 0.336\n", "samples = 23.2%\n", "value = [0.938, 0.062]\n", "class = <=50K\n", "\n", "\n", "\n", "2->4\n", "\n", "\n", "\n", "\n", "\n", "6\n", "\n", "entropy = 0.209\n", "samples = 2.8%\n", "value = [0.967, 0.033]\n", "class = <=50K\n", "\n", "\n", "\n", "5->6\n", "\n", "\n", "\n", "\n", "\n", "7\n", "\n", "entropy = 0.764\n", "samples = 8.7%\n", "value = [0.778, 0.222]\n", "class = <=50K\n", "\n", "\n", "\n", "5->7\n", "\n", "\n", "\n", "\n", "\n", "9\n", "\n", "occupation_ Prof-specialty <= 0.5\n", "entropy = 0.968\n", "samples = 37.4%\n", "value = [0.604, 0.396]\n", "class = <=50K\n", "\n", "\n", "\n", "8->9\n", "\n", "\n", "\n", "\n", "\n", "12\n", "\n", "occupation_ Exec-managerial <= 0.5\n", "entropy = 0.912\n", "samples = 8.5%\n", "value = [0.327, 0.673]\n", "class = >50K\n", "\n", "\n", "\n", "8->12\n", "\n", "\n", "\n", "\n", "\n", "10\n", "\n", "entropy = 0.937\n", "samples = 33.1%\n", "value = [0.647, 0.353]\n", "class = <=50K\n", "\n", "\n", "\n", "9->10\n", "\n", "\n", "\n", "\n", "\n", "11\n", "\n", "entropy = 0.847\n", "samples = 4.3%\n", "value = [0.274, 0.726]\n", "class = >50K\n", "\n", "\n", "\n", "9->11\n", "\n", "\n", "\n", "\n", "\n", "13\n", "\n", "entropy = 0.956\n", "samples = 5.8%\n", "value = [0.377, 0.623]\n", "class = >50K\n", "\n", "\n", "\n", "12->13\n", "\n", "\n", "\n", "\n", "\n", "14\n", "\n", "entropy = 0.751\n", "samples = 2.6%\n", "value = [0.215, 0.785]\n", "class = >50K\n", "\n", "\n", "\n", "12->14\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Source(export_graphviz(\n", " model, out_file=None,\n", " feature_names=X_train.columns,\n", " class_names=['<=50K', '>50K'],\n", " proportion=True,\n", " filled=True, rounded=True # 見た目の調整\n", "))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 4 }