AI 第4天：深入數據分析方法與更高階應用

發布於 2024/11/04

作者 Ned Hsu 5 分鐘閱讀

以下介紹幾個深入數據分析的技巧與應用，幫助你處理更複雜的數據，並進行進階的數據洞察。

1. 進階數據探索與統計分析

假設檢定：T 檢定與卡方檢定

T 檢定：檢查數值型特徵與目標變數的顯著性差異。
卡方檢定：檢查類別型特徵與目標變數的相關性。
```python from scipy.stats import ttest_ind, chi2_contingency

T 檢定範例

group1 = df[df[‘Category’] == ‘A’][‘Numeric_Feature’] group2 = df[df[‘Category’] == ‘B’][‘Numeric_Feature’] t_stat, p_value = ttest_ind(group1, group2) print(f”T檢定結果: t值={t_stat}, p值={p_value}”)

卡方檢定範例

contingency_table = pd.crosstab(df[‘Category_Column’], df[‘Target_Column’]) chi2, p, dof, expected = chi2_contingency(contingency_table) print(f”卡方檢定結果: chi2={chi2}, p值={p}”)

---

## **2. 特徵工程的進階技術**

### **分箱與分組處理**  
分箱是將連續數據轉換為類別數據的方法，用於提高模型表現或處理非線性關係。  
```python
# 使用 pandas 分箱
df['Income_Bin'] = pd.cut(df['Income'], bins=[0, 50000, 100000, 150000], labels=['Low', 'Medium', 'High'])

# 使用分組統計創建特徵
df['Group_Average'] = df.groupby('Category_Column')['Numeric_Column'].transform('mean')

時間序列特徵提取

如果數據包含時間戳，可提取日期相關特徵（如月、日、工作日等）。

  
# 提取日期特徵
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day_of_Week'] = df['Date'].dt.dayofweek

3. 高階數據清理技術

處理多重共線性

共線性會影響模型穩定性，可以用 VIF（方差膨脹因子）檢測並消除高共線性的特徵。

  
from statsmodels.stats.outliers_influence import variance_inflation_factor

# 計算 VIF
X = df[['Feature1', 'Feature2', 'Feature3']]
vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

處理不平衡數據

使用過採樣（如 SMOTE）或欠採樣技術解決類別不平衡問題。

  
from imblearn.over_sampling import SMOTE

# SMOTE 過採樣
X, y = df.drop(columns=['Target']), df['Target']
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)

4. 進階資料視覺化

多變量關係分析：Pairplot

用於視覺化多個數值特徵與目標變數的關係。

  
sns.pairplot(df, hue='Target_Column', diag_kind='kde')
plt.show()

時間序列分析視覺化

使用滾動平均線或移動標準差分析數據趨勢與波動。

  
df['Rolling_Mean'] = df['Value'].rolling(window=7).mean()
df['Rolling_Std'] = df['Value'].rolling(window=7).std()

plt.plot(df['Date'], df['Value'], label='Original')
plt.plot(df['Date'], df['Rolling_Mean'], label='Rolling Mean')
plt.plot(df['Date'], df['Rolling_Std'], label='Rolling Std')
plt.legend()
plt.show()

5. 高階應用案例

案例 1：營銷數據分析

目標：分析產品銷售數據，找出最佳營銷策略。

  
# 聚合分析
sales_summary = df.groupby('Product_Category')['Sales'].agg(['sum', 'mean', 'count']).reset_index()

# 繪製條形圖
sns.barplot(x='Product_Category', y='sum', data=sales_summary)
plt.title("Total Sales by Product Category")
plt.show()

案例 2：客戶分群分析（K-Means）

目標：基於消費者數據進行分群，找出不同客戶群體的特徵。

  
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# 標準化數據
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[['Feature1', 'Feature2', 'Feature3']])

# K-Means 分群
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(scaled_data)

# 可視化分群結果
sns.scatterplot(x='Feature1', y='Feature2', hue='Cluster', data=df, palette='viridis')
plt.title("Customer Segmentation")
plt.show()

6. 數據分析進階工具與資源

工具推薦

Tableau 或 Power BI：專業數據視覺化工具，用於商業報表和互動式數據分析。
SQL：用於大型數據集的查詢和預處理。
PyCaret：一個用於快速數據分析和建模的開源庫。

課後挑戰

分析股票數據，進行趨勢與波動的時間序列分析。
使用公開數據集（如 Kaggle 的信用卡欺詐檢測），清洗並處理不平衡問題後建模預測。

Software, AI

AI Python

本文章以 CC BY 4.0 授權