機械学習入門完全ガイド！初心者でもわかるAIの仕組みと実装方法

2025年9月2日

「機械学習やAIって最近よく聞くけれど、実際どんな仕組みなの？」「プログラミング初心者でも機械学習を理解できる？」「理論は難しそうだから、まずは実際に動かしてみたい」

このような疑問を持つ方は非常に多いでしょう。確かに、機械学習は数学的な理論が複雑で、初心者には敷居が高く感じられるかもしれません。

しかし、**機械学習の基本的な考え方と仕組みは、数学が苦手な方でも十分理解できます。**重要なのは複雑な数式を覚えることではなく、「どういう問題に使えるのか」「どう活用すればビジネス価値を生み出せるのか」という実用的な理解です。

本記事では、機械学習の基本概念から実際のPythonコードでの実装まで、初心者の方にも分かりやすく体系的に解説します。理論だけでなく実際に手を動かすことで、機械学習の面白さと可能性を実感していただけるはずです。

機械学習とは何か？

従来のプログラミングとの違い

従来のプログラミングでは、人間が明確なルールを書いてコンピュータに指示していました。例えば、「もし気温が30度以上なら『暑い』と表示する」といった具合です。

一方、機械学習ではデータからパターンを自動的に学習し、そのパターンに基づいて予測や判断を行います。人間が明確なルールを書く必要がないのが最大の特徴です。

# 従来のプログラミングの例
def classify_temperature(temp):
    if temp >= 30:
        return "暑い"
    elif temp >= 20:
        return "暖かい"
    elif temp >= 10:
        return "涼しい"
    else:
        return "寒い"

# 機械学習のアプローチ
# データから気温と人間の感覚の関係を学習
# → 新しい気温データに対して自動的に判定

この違いにより、機械学習では以下のようなことが可能になります：

複雑なパターンの認識 画像の中から猫を見つけたり、音声から文字を起こしたりするような、人間が明確なルールを書くのが困難な問題も解決できます。

データからの新しい発見 人間が気づかなかったデータの中の隠れたパターンを発見し、新しい知見を得ることができます。

継続的な改善 新しいデータが得られるたびに、モデルの性能を向上させることができます。

機械学習が活用される身近な例

機械学習は既に私たちの日常生活に深く浸透しています。意識していないかもしれませんが、以下のようなサービスで機械学習が使われています：

Eコマース・エンターテイメント

Amazonの商品推薦：「この商品を買った人はこんな商品も買っています」
Netflixの動画推薦：視聴履歴から好みそうな作品を提案
YouTubeの関連動画：視聴パターンから次に見たい動画を予測

SNS・コミュニケーション

FacebookやInstagramのニュースフィード：興味がありそうな投稿を優先表示
迷惑メールフィルタ：メールの内容から迷惑メールを自動判定
Google翻訳：文章の意味を理解して自然な翻訳を生成

交通・ナビゲーション

Google Maps の渋滞予測：交通データから最適なルートを提案
自動運転技術：センサーデータから周囲の状況を判断

これらの例から分かるように、機械学習は「人間の判断や認識をコンピュータで再現・支援する」技術として幅広く活用されています。

機械学習の種類と特徴

教師あり学習：正解データから学ぶ

教師あり学習は、正解となるデータ（ラベル）と一緒に学習する手法です。「この入力に対する正解はこれ」という例をたくさん見せることで、モデルが入力と出力の関係を学習します。

分類問題：カテゴリを予測する

具体例：メールが迷惑メールかどうかを判定する

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt

# サンプルデータの作成（実際にはもっと大量のデータを使用）
emails = [
    ("お得な情報！今すぐクリック！", "spam"),
    ("明日の会議の資料です", "normal"),
    ("限定セール！50%オフ！", "spam"),
    ("プロジェクトの進捗報告", "normal"),
    ("緊急！すぐに確認してください", "spam"),
    ("来週の予定について", "normal"),
    ("無料で稼げる方法", "spam"),
    ("データ分析の結果", "normal"),
    ("今だけ特別価格", "spam"),
    ("会議室の予約確認", "normal")
]

# より多くのサンプルデータを生成
normal_phrases = [
    "会議の", "資料", "報告", "確認", "予定", "プロジェクト", 
    "データ", "分析", "結果", "進捗", "予約", "連絡"
]
spam_phrases = [
    "お得", "限定", "特別", "無料", "セール", "クリック", 
    "緊急", "今だけ", "稼げる", "特価", "割引", "チャンス"
]

# 追加データ生成
for i in range(50):
    if i % 2 == 0:
        normal_email = f"{np.random.choice(normal_phrases)} {np.random.choice(normal_phrases)}"
        emails.append((normal_email, "normal"))
    else:
        spam_email = f"{np.random.choice(spam_phrases)}！{np.random.choice(spam_phrases)}"
        emails.append((spam_email, "spam"))

# データフレームに変換
df = pd.DataFrame(emails, columns=['text', 'label'])
print("データの確認:")
print(df.head(10))
print(f"\n正常メール: {len(df[df['label'] == 'normal'])}件")
print(f"迷惑メール: {len(df[df['label'] == 'spam'])}件")

このような分類問題では、モデルは「どんな単語が含まれていると迷惑メールの可能性が高いか」を学習します。例えば、「お得」「限定」「クリック」といった単語があると迷惑メールの確率が高くなることを自動的に発見します。

# テキストデータを数値データに変換
vectorizer = TfidfVectorizer(max_features=100)
X = vectorizer.fit_transform(df['text'])
y = df['label']

# 訓練データとテストデータに分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ナイーブベイズ分類器で学習
model = MultinomialNB()
model.fit(X_train, y_train)

# 予測と評価
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"\n=== 分類モデルの性能 ===")
print(f"正解率: {accuracy:.2%}")
print("\n詳細な評価:")
print(classification_report(y_test, y_pred))

# 新しいメールの判定
new_emails = [
    "明日の打ち合わせの件",
    "限定セール！今すぐお得にお買い物！"
]

for email in new_emails:
    email_vector = vectorizer.transform([email])
    prediction = model.predict(email_vector)[0]
    probability = model.predict_proba(email_vector)[0]
    
    print(f"\nメール: '{email}'")
    print(f"判定: {prediction}")
    print(f"確率: 正常 {probability[0]:.2%}, 迷惑 {probability[1]:.2%}")

回帰問題：数値を予測する

具体例：住宅価格の予測

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# 住宅価格予測のサンプルデータを生成
# 実際には立地、広さ、築年数などの特徴量を使用
np.random.seed(42)
n_samples = 200

# 特徴量生成（広さ、築年数、駅からの距離）
area = np.random.normal(80, 30, n_samples)  # 広さ（㎡）
age = np.random.uniform(0, 30, n_samples)   # 築年数
distance = np.random.uniform(0.5, 10, n_samples)  # 駅からの距離（km）

# 価格計算（実際の関係性を模擬）
price = (area * 5 + (30 - age) * 3 - distance * 10 + 
         np.random.normal(0, 50, n_samples)) * 10000

# 負の価格を防ぐ
price = np.maximum(price, 100000)

# データフレーム作成
house_df = pd.DataFrame({
    'area': area,
    'age': age, 
    'distance': distance,
    'price': price
})

print("=== 住宅データの確認 ===")
print(house_df.head())
print(f"\n平均価格: {house_df['price'].mean():,.0f}円")
print(f"価格の範囲: {house_df['price'].min():,.0f}円 〜 {house_df['price'].max():,.0f}円")

# 特徴量と目的変数に分割
X = house_df[['area', 'age', 'distance']]
y = house_df['price']

# 訓練・テストデータ分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 線形回帰モデルで学習
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)

# 予測と評価
y_pred = reg_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"\n=== 回帰モデルの性能 ===")
print(f"平均二乗誤差: {mse:,.0f}")
print(f"決定係数 (R²): {r2:.3f}")
print(f"平均絶対誤差: {np.mean(np.abs(y_test - y_pred)):,.0f}円")

# 特徴量の重要度（回帰係数）
feature_importance = pd.DataFrame({
    'feature': ['広さ(㎡)', '築年数', '駅距離(km)'],
    'coefficient': reg_model.coef_
})
print(f"\n=== 特徴量の影響度 ===")
print(feature_importance)

# 新しい物件の価格予測
new_properties = [
    [100, 5, 2],   # 100㎡、築5年、駅徒歩2km
    [60, 15, 0.5], # 60㎡、築15年、駅徒歩0.5km
    [120, 0, 3]    # 120㎡、新築、駅徒歩3km
]

print(f"\n=== 新しい物件の価格予測 ===")
for i, prop in enumerate(new_properties):
    predicted_price = reg_model.predict([prop])[0]
    print(f"物件{i+1} (広さ:{prop[0]}㎡, 築{prop[1]}年, 駅{prop[2]}km): {predicted_price:,.0f}円")

回帰問題では、モデルが「広さが大きいほど価格が高い」「築年数が古いほど価格が安い」といった関係性を自動的に学習します。

教師なし学習：データの隠れた構造を発見

教師なし学習は、正解データなしでデータの中のパターンや構造を発見する手法です。「データの中に何か面白いパターンはないか」を探索する際に使われます。

クラスタリング：似たもの同士をグループ化

具体例：顧客セグメンテーション

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import seaborn as sns

# 顧客データの生成
np.random.seed(123)
n_customers = 300

# 異なる顧客層をシミュレート
# 若い層（低収入・高頻度購入）
young_income = np.random.normal(300, 50, 100)
young_frequency = np.random.normal(20, 5, 100)

# 中年層（高収入・中頻度購入）
middle_income = np.random.normal(600, 100, 100)
middle_frequency = np.random.normal(10, 3, 100)

# 高齢層（中収入・低頻度購入）
senior_income = np.random.normal(450, 80, 100)
senior_frequency = np.random.normal(5, 2, 100)

# データ結合
income = np.concatenate([young_income, middle_income, senior_income])
frequency = np.concatenate([young_frequency, middle_frequency, senior_frequency])

customer_df = pd.DataFrame({
    'monthly_income': income,
    'purchase_frequency': frequency
})

print("=== 顧客データの確認 ===")
print(customer_df.describe())

# データの標準化
scaler = StandardScaler()
customer_scaled = scaler.fit_transform(customer_df)

# K-meansクラスタリング
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(customer_scaled)

# 結果をデータフレームに追加
customer_df['cluster'] = clusters

# クラスター別の統計
print(f"\n=== クラスター別統計 ===")
cluster_stats = customer_df.groupby('cluster').agg({
    'monthly_income': ['mean', 'std'],
    'purchase_frequency': ['mean', 'std'],
}).round(1)
print(cluster_stats)

# 可視化
plt.figure(figsize=(12, 5))

# 元データ
plt.subplot(1, 2, 1)
plt.scatter(customer_df['monthly_income'], customer_df['purchase_frequency'], 
           alpha=0.6, s=50)
plt.xlabel('月収（万円）')
plt.ylabel('月間購入回数')
plt.title('顧客データ（クラスタリング前）')
plt.grid(True, alpha=0.3)

# クラスタリング結果
plt.subplot(1, 2, 2)
colors = ['red', 'blue', 'green']
for i in range(3):
    cluster_data = customer_df[customer_df['cluster'] == i]
    plt.scatter(cluster_data['monthly_income'], cluster_data['purchase_frequency'],
               c=colors[i], label=f'クラスター{i}', alpha=0.6, s=50)

plt.xlabel('月収（万円）')
plt.ylabel('月間購入回数')
plt.title('顧客セグメンテーション結果')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# ビジネス的解釈
print(f"\n=== ビジネス的解釈 ===")
interpretations = [
    "高頻度購入層（若年・学生層）: 低収入だが購入頻度が高い",
    "高価値顧客層（中年層）: 高収入で安定した購入",
    "ライト顧客層（高齢層）: 中程度収入で低頻度購入"
]

for i, interpretation in enumerate(interpretations):
    cluster_size = len(customer_df[customer_df['cluster'] == i])
    print(f"クラスター{i} ({cluster_size}人): {interpretation}")

このクラスタリング分析により、企業は顧客を3つのセグメントに分けて、それぞれに適したマーケティング戦略を立てることができます。例えば：

クラスター0（高頻度購入層）: 割引クーポンやポイントプログラムで購入を促進
クラスター1（高価値顧客層）: プレミアム商品やVIP待遇で満足度向上
クラスター2（ライト顧客層）: お試し商品や簡単な使用方法の提案で購入頻度向上

強化学習：試行錯誤から学ぶ

強化学習は、環境との相互作用を通じて最適な行動を学習する手法です。ゲームやロボット制御などで活用されています。

具体的な流れ：

現在の状況を観察
何らかの行動を選択
行動の結果として報酬（または罰）を受け取る
この経験から、より良い行動を学習

身近な例では、AlphaGoが囲碁で人間のプロ棋士に勝利したり、自動運転車が安全な運転を学習したりする際に使われています。

機械学習のワークフロー

データの準備：成功の80%を決める

機械学習プロジェクトにおいて、データの準備は全工程の約80%を占めると言われています。良いデータなしには、どんなに優秀なアルゴリズムを使っても良い結果は得られません。

データ収集と品質確認

# 実践的なデータ前処理の例
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# サンプルデータ（欠損値や異常値を含む）
np.random.seed(456)
raw_data = {
    'age': [25, 30, np.nan, 35, 28, 45, 999, 32],  # 999は入力ミス
    'income': [300, 450, 520, np.nan, 380, 720, 650, 580],
    'education': ['high_school', 'university', 'university', 'graduate', 
                 'high_school', 'graduate', np.nan, 'university'],
    'purchase_amount': [50000, 120000, 98000, 150000, 75000, 200000, 180000, 110000]
}

df = pd.DataFrame(raw_data)
print("=== 元データ ===")
print(df)
print(f"\nデータ形状: {df.shape}")
print(f"\n各列の欠損値:")
print(df.isnull().sum())

データクリーニングの実施

# 1. 異常値の検出と処理
print(f"\n=== 異常値の確認 ===")
print(f"年齢の統計: {df['age'].describe()}")

# 年齢の異常値（999）を欠損値に変換
df.loc[df['age'] > 100, 'age'] = np.nan
print(f"異常値処理後の年齢統計: {df['age'].describe()}")

# 2. 欠損値の処理
print(f"\n=== 欠損値の処理 ===")

# 数値データ：平均値で補完
numeric_imputer = SimpleImputer(strategy='mean')
df[['age', 'income']] = numeric_imputer.fit_transform(df[['age', 'income']])

# カテゴリデータ：最頻値で補完
categorical_imputer = SimpleImputer(strategy='most_frequent')
df[['education']] = categorical_imputer.fit_transform(df[['education']])

print("欠損値処理後:")
print(df.isnull().sum())
print(f"\n処理後データ:")
print(df)

特徴量エンジニアリング

# 3. 特徴量エンジニアリング
print(f"\n=== 特徴量エンジニアリング ===")

# カテゴリ変数の数値化
le = LabelEncoder()
df['education_encoded'] = le.fit_transform(df['education'])

# 新しい特徴量の作成
df['income_per_age'] = df['income'] / df['age']  # 年齢当たり収入
df['purchase_ratio'] = df['purchase_amount'] / df['income']  # 収入に対する購入比率

# 年齢カテゴリの作成
def categorize_age(age):
    if age < 30:
        return 'young'
    elif age < 45:
        return 'middle'
    else:
        return 'senior'

df['age_category'] = df['age'].apply(categorize_age)

print("特徴量エンジニアリング後:")
print(df.head())

# 4. データの標準化
print(f"\n=== データの標準化 ===")
numeric_features = ['age', 'income', 'income_per_age', 'purchase_ratio']
scaler = StandardScaler()
df_scaled = df.copy()
df_scaled[numeric_features] = scaler.fit_transform(df[numeric_features])

print("標準化後の統計（平均≈0, 標準偏差≈1になる）:")
print(df_scaled[numeric_features].describe().round(3))

モデル選択と訓練

# モデル比較の実践例
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error

# 目的変数と説明変数の準備
X = df_scaled[['age', 'income', 'education_encoded', 'income_per_age']]
y = df['purchase_amount']

# 複数のモデルを比較
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Support Vector Regression': SVR(kernel='rbf')
}

print("=== モデル比較（交差検証） ===")
results = {}

for name, model in models.items():
    # 5分割交差検証でR²スコアを計算
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
    results[name] = {
        'mean_r2': cv_scores.mean(),
        'std_r2': cv_scores.std()
    }
    print(f"{name}:")
    print(f"  R²スコア: {cv_scores.mean():.3f} (±{cv_scores.std():.3f})")

# 最良のモデルで詳細分析
best_model_name = max(results.keys(), key=lambda k: results[k]['mean_r2'])
best_model = models[best_model_name]

print(f"\n=== 最良モデル: {best_model_name} ===")

# 訓練とテストデータ分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# モデル訓練
best_model.fit(X_train, y_train)

# 予測と評価
y_pred = best_model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"テストデータでの性能:")
print(f"  平均絶対誤差: {mae:,.0f}円")
print(f"  R²スコア: {r2:.3f}")

# 特徴量重要度（Random Forestの場合）
if hasattr(best_model, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'feature': ['年齢', '収入', '教育レベル', '年齢当たり収入'],
        'importance': best_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print(f"\n特徴量重要度:")
    print(feature_importance)

モデル評価と改善

機械学習モデルの性能を正しく評価することは、実用化において極めて重要です。

分類問題の評価指標

# 分類問題の詳細な評価例
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_curve, auc
from sklearn.model_selection import learning_curve

# 先ほどの迷惑メール分類モデルを使用
print("=== 分類モデルの詳細評価 ===")

# 混同行列の作成
cm = confusion_matrix(y_test, y_pred, labels=['normal', 'spam'])
print("混同行列:")
print(pd.DataFrame(cm, 
                  index=['実際: Normal', '実際: Spam'],
                  columns=['予測: Normal', '予測: Spam']))

# 各メトリクスの計算
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred, pos_label='spam')
recall = recall_score(y_test, y_pred, pos_label='spam')
f1 = f1_score(y_test, y_pred, pos_label='spam')

print(f"\n性能指標:")
print(f"精度 (Precision): {precision:.3f}")
print(f"再現率 (Recall): {recall:.3f}")
print(f"F1スコア: {f1:.3f}")

print(f"\nビジネス的解釈:")
print(f"- 精度: 迷惑メールと判定したもののうち{precision:.1%}が実際に迷惑メール")
print(f"- 再現率: 実際の迷惑メールのうち{recall:.1%}を正しく検出")
print(f"- F1スコア: 精度と再現率のバランス指標")

学習曲線による過学習の検出

# 学習曲線を描画して過学習を確認
def plot_learning_curve(model, X, y, title):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, cv=5, n_jobs=-1, 
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='r2'
    )
    
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, 'o-', color='blue', label='訓練スコア')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, 
                     alpha=0.1, color='blue')
    
    plt.plot(train_sizes, val_mean, 'o-', color='red', label='検証スコア')
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, 
                     alpha=0.1, color='red')
    
    plt.xlabel('訓練データサイズ')
    plt.ylabel('R²スコア')
    plt.title(f'学習曲線: {title}')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
    
    # 過学習の判定
    final_train_score = train_mean[-1]
    final_val_score = val_mean[-1]
    gap = final_train_score - final_val_score
    
    print(f"最終的な訓練スコア: {final_train_score:.3f}")
    print(f"最終的な検証スコア: {final_val_score:.3f}")
    print(f"スコア差: {gap:.3f}")
    
    if gap > 0.1:
        print("→ 過学習の兆候あり。より多くのデータまたは正則化が必要")
    else:
        print("→ 適切な学習状態")

# 学習曲線の描画
plot_learning_curve(RandomForestRegressor(n_estimators=100, random_state=42), 
                   X, y, "Random Forest")

実際のビジネス課題への応用

売上予測システムの構築

実際のビジネスで最もよく使われる機械学習の応用例として、売上予測を詳しく見てみましょう。

# 実践的な売上予測システム
from datetime import datetime, timedelta
import pandas as pd

# 時系列売上データの生成
np.random.seed(789)
start_date = datetime(2022, 1, 1)
dates = [start_date + timedelta(days=x) for x in range(365*2)]

# 売上に影響する要因をシミュレート
seasonal_effect = [200 * np.sin(2 * np.pi * i / 365) for i in range(730)]
trend_effect = [i * 0.5 for i in range(730)]
day_of_week_effect = []
promotional_effect = []

for i, date in enumerate(dates):
    # 曜日効果（土日は売上1.5倍）
    if date.weekday() >= 5:  # 土日
        day_of_week_effect.append(150)
    else:
        day_of_week_effect.append(0)
    
    # プロモーション効果（月初と月末）
    if date.day <= 5 or date.day >= 25:
        promotional_effect.append(100)
    else:
        promotional_effect.append(0)

# 総合売上の計算
base_sales = 1000
noise = np.random.normal(0, 50, 730)
daily_sales = (base_sales + 
               np.array(seasonal_effect) + 
               np.array(trend_effect) + 
               np.array(day_of_week_effect) + 
               np.array(promotional_effect) + 
               noise)

# データフレーム作成
sales_df = pd.DataFrame({
    'date': dates,
    'sales': daily_sales,
    'day_of_week': [d.weekday() for d in dates],
    'month': [d.month for d in dates],
    'day_of_month': [d.day for d in dates],
    'is_weekend': [1 if d.weekday() >= 5 else 0 for d in dates],
    'is_promotion_period': [1 if d.day <= 5 or d.day >= 25 else 0 for d in dates]
})

# 移動平均などの特徴量を追加
sales_df['sales_7day_avg'] = sales_df['sales'].rolling(window=7).mean()
sales_df['sales_30day_avg'] = sales_df['sales'].rolling(window=30).mean()
sales_df['sales_lag_1'] = sales_df['sales'].shift(1)  # 前日売上

# 欠損値を除去
sales_df = sales_df.dropna()

print("=== 売上データの確認 ===")
print(sales_df.head(10))
print(f"\n売上統計:")
print(sales_df['sales'].describe())

# 特徴量と目的変数の準備
feature_cols = ['day_of_week', 'month', 'day_of_month', 'is_weekend', 
               'is_promotion_period', 'sales_7day_avg', 'sales_30day_avg', 'sales_lag_1']
X = sales_df[feature_cols]
y = sales_df['sales']

# 時系列データなので時間順で分割
split_date = sales_df['date'].quantile(0.8)
train_mask = sales_df['date'] <= split_date
test_mask = sales_df['date'] > split_date

X_train, X_test = X[train_mask], X[test_mask]
y_train, y_test = y[train_mask], y[test_mask]

print(f"\n訓練データ期間: {sales_df[train_mask]['date'].min()} ~ {sales_df[train_mask]['date'].max()}")
print(f"テストデータ期間: {sales_df[test_mask]['date'].min()} ~ {sales_df[test_mask]['date'].max()}")

# 複数モデルでの売上予測
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression

models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42)
}

print("=== 売上予測モデル比較 ===")
model_results = {}

for name, model in models.items():
    # 訓練
    model.fit(X_train, y_train)
    
    # 予測
    y_pred = model.predict(X_test)
    
    # 評価
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
    
    model_results[name] = {
        'MAE': mae,
        'RMSE': rmse,
        'MAPE': mape,
        'model': model,
        'predictions': y_pred
    }
    
    print(f"\n{name}:")
    print(f"  平均絶対誤差 (MAE): {mae:.1f}")
    print(f"  二乗平均平方根誤差 (RMSE): {rmse:.1f}")
    print(f"  平均絶対パーセント誤差 (MAPE): {mape:.1f}%")

# 最良モデルの選択
best_model_name = min(model_results.keys(), key=lambda k: model_results[k]['MAPE'])
best_model = model_results[best_model_name]['model']
best_predictions = model_results[best_model_name]['predictions']

print(f"\n最良モデル: {best_model_name} (MAPE: {model_results[best_model_name]['MAPE']:.1f}%)")

# 結果の可視化
plt.figure(figsize=(15, 10))

# 実際の売上 vs 予測売上
plt.subplot(2, 2, 1)
plt.scatter(y_test, best_predictions, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('実際の売上')
plt.ylabel('予測売上')
plt.title(f'予測精度: {best_model_name}')
plt.grid(True, alpha=0.3)

# 時系列での比較
plt.subplot(2, 2, 2)
test_dates = sales_df[test_mask]['date'].values
plt.plot(test_dates, y_test.values, label='実際の売上', alpha=0.7)
plt.plot(test_dates, best_predictions, label='予測売上', alpha=0.7)
plt.xlabel('日付')
plt.ylabel('売上')
plt.title('時系列予測結果')
plt.legend()
plt.xticks(rotation=45)

# 特徴量重要度
if hasattr(best_model, 'feature_importances_'):
    plt.subplot(2, 2, 3)
    importance_df = pd.DataFrame({
        'feature': feature_cols,
        'importance': best_model.feature_importances_
    }).sort_values('importance', ascending=True)
    
    plt.barh(importance_df['feature'], importance_df['importance'])
    plt.xlabel('重要度')
    plt.title('特徴量重要度')

# 予測誤差の分布
plt.subplot(2, 2, 4)
errors = y_test - best_predictions
plt.hist(errors, bins=20, alpha=0.7, edgecolor='black')
plt.xlabel('予測誤差')
plt.ylabel('頻度')
plt.title('予測誤差の分布')
plt.axvline(x=0, color='red', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

# ビジネス的な洞察
print(f"\n=== ビジネス洞察 ===")
if hasattr(best_model, 'feature_importances_'):
    top_features = importance_df.tail(3)
    print("売上に最も影響する要因:")
    for _, row in top_features.iterrows():
        print(f"  {row['feature']}: {row['importance']:.3f}")

print(f"\n予測精度の評価:")
mape = model_results[best_model_name]['MAPE']
if mape < 5:
    print("→ 非常に高い予測精度。実用レベル")
elif mape < 10:
    print("→ 高い予測精度。ビジネス活用可能")
elif mape < 20:
    print("→ 中程度の予測精度。参考指標として活用")
else:
    print("→ 予測精度が低い。モデル改善が必要")

顧客離反予測システム

もう一つの重要なビジネス応用例として、顧客の離反（チャーン）を予測するシステムを構築してみましょう。

# 顧客離反予測の実例
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score, roc_curve

# 顧客離反データの生成
np.random.seed(999)
n_customers = 1000

# 顧客の特徴量を生成
customer_data = {
    'tenure_months': np.random.normal(24, 12, n_customers),  # 契約期間
    'monthly_charges': np.random.normal(70, 20, n_customers),  # 月額料金
    'total_charges': np.random.normal(1600, 800, n_customers),  # 総支払額
    'support_calls': np.random.poisson(2, n_customers),  # サポート問い合わせ回数
    'contract_type': np.random.choice([0, 1, 2], n_customers, p=[0.4, 0.3, 0.3]),  # 契約タイプ
    'payment_method': np.random.choice([0, 1, 2], n_customers, p=[0.5, 0.3, 0.2]),  # 支払方法
}

# 離反確率をロジスティック関数で計算（現実的な関係性を模擬）
tenure_effect = -0.05 * customer_data['tenure_months']  # 長期契約ほど離反しにくい
charges_effect = 0.01 * customer_data['monthly_charges']  # 高額ほど離反しやすい
support_effect = 0.3 * customer_data['support_calls']  # サポート問い合わせが多いほど離反しやすい

# ロジスティック関数で離反確率を計算
linear_combination = -2 + tenure_effect + charges_effect + support_effect
churn_probability = 1 / (1 + np.exp(-linear_combination))

# 離反の実際の発生をシミュレート
customer_data['churned'] = np.random.binomial(1, churn_probability, n_customers)

# データフレーム化
churn_df = pd.DataFrame(customer_data)

# 負の値の修正
churn_df['tenure_months'] = np.maximum(churn_df['tenure_months'], 1)
churn_df['monthly_charges'] = np.maximum(churn_df['monthly_charges'], 20)
churn_df['total_charges'] = np.maximum(churn_df['total_charges'], 100)

print("=== 顧客離反データの確認 ===")
print(churn_df.head())
print(f"\n全体の離反率: {churn_df['churned'].mean():.1%}")
print(f"離反顧客数: {churn_df['churned'].sum()}人")
print(f"継続顧客数: {len(churn_df) - churn_df['churned'].sum()}人")

# 離反要因の分析
print(f"\n=== 離反要因の分析 ===")
churn_stats = churn_df.groupby('churned').agg({
    'tenure_months': 'mean',
    'monthly_charges': 'mean',
    'support_calls': 'mean'
}).round(1)
churn_stats.index = ['継続', '離反']
print(churn_stats)

# 離反予測モデルの構築
from sklearn.model_selection import StratifiedKFold

# 特徴量と目的変数の準備
feature_cols = ['tenure_months', 'monthly_charges', 'total_charges', 
               'support_calls', 'contract_type', 'payment_method']
X = churn_df[feature_cols]
y = churn_df['churned']

# データ分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                   random_state=42, stratify=y)

# 複数モデルの比較
churn_models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

print("=== 離反予測モデル比較 ===")
churn_results = {}

for name, model in churn_models.items():
    # 訓練
    model.fit(X_train, y_train)
    
    # 予測
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # 評価
    accuracy = accuracy_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    churn_results[name] = {
        'accuracy': accuracy,
        'roc_auc': roc_auc,
        'model': model,
        'predictions': y_pred,
        'probabilities': y_pred_proba
    }
    
    print(f"\n{name}:")
    print(f"  正解率: {accuracy:.3f}")
    print(f"  ROC-AUC: {roc_auc:.3f}")

# 最良モデルの詳細評価
best_churn_model = max(churn_results.keys(), key=lambda k: churn_results[k]['roc_auc'])
best_model = churn_results[best_churn_model]['model']
best_pred_proba = churn_results[best_churn_model]['probabilities']

print(f"\n=== 最良モデル: {best_churn_model} ===")
print("\n分類レポート:")
print(classification_report(y_test, churn_results[best_churn_model]['predictions']))

# ROC曲線の描画
plt.figure(figsize=(12, 8))

plt.subplot(2, 2, 1)
fpr, tpr, thresholds = roc_curve(y_test, best_pred_proba)
plt.plot(fpr, tpr, label=f'ROC曲線 (AUC = {roc_auc_score(y_test, best_pred_proba):.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='ランダム')
plt.xlabel('偽陽性率')
plt.ylabel('真陽性率')
plt.title('ROC曲線')
plt.legend()
plt.grid(True, alpha=0.3)

# 離反確率の分布
plt.subplot(2, 2, 2)
plt.hist(best_pred_proba[y_test == 0], bins=20, alpha=0.7, label='継続顧客', density=True)
plt.hist(best_pred_proba[y_test == 1], bins=20, alpha=0.7, label='離反顧客', density=True)
plt.xlabel('離反確率')
plt.ylabel('密度')
plt.title('離反確率の分布')
plt.legend()

# 特徴量重要度
if hasattr(best_model, 'feature_importances_'):
    plt.subplot(2, 2, 3)
    importance_df = pd.DataFrame({
        'feature': feature_cols,
        'importance': best_model.feature_importances_
    }).sort_values('importance', ascending=True)
    
    plt.barh(importance_df['feature'], importance_df['importance'])
    plt.xlabel('重要度')
    plt.title('特徴量重要度')

# 混同行列
plt.subplot(2, 2, 4)
cm = confusion_matrix(y_test, churn_results[best_churn_model]['predictions'])
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
           xticklabels=['継続', '離反'], yticklabels=['継続', '離反'])
plt.title('混同行列')
plt.ylabel('実際')
plt.xlabel('予測')

plt.tight_layout()
plt.show()

# ビジネス応用：高リスク顧客の特定
print(f"\n=== ビジネス応用：高リスク顧客の特定 ===")

# 離反確率が高い顧客を特定
high_risk_threshold = 0.7
test_customers = X_test.copy()
test_customers['churn_probability'] = best_pred_proba
test_customers['actual_churn'] = y_test.values

high_risk_customers = test_customers[test_customers['churn_probability'] > high_risk_threshold]
print(f"離反確率{high_risk_threshold:.0%}以上の高リスク顧客: {len(high_risk_customers)}人")

if len(high_risk_customers) > 0:
    print(f"高リスク顧客の特徴:")
    print(high_risk_customers[feature_cols].mean().round(1))
    
    # 実際の離反率
    actual_churn_rate = high_risk_customers['actual_churn'].mean()
    print(f"高リスク顧客の実際の離反率: {actual_churn_rate:.1%}")
    
    # 対策提案
    print(f"\n推奨する対策:")
    if high_risk_customers['support_calls'].mean() > 3:
        print("- サポート品質の向上（問い合わせが多い顧客が離反しやすい）")
    if high_risk_customers['monthly_charges'].mean() > 80:
        print("- 高額顧客向けの特別割引やプレミアムサービス")
    if high_risk_customers['tenure_months'].mean() < 12:
        print("- 新規顧客向けのオンボーディングプログラム強化")