一、为什么DWS层需要宽表建模?
在数据仓库中,DWS层(Data Warehouse Service) 是连接明细数据(DWD层)与上层应用(ADS层)的核心枢纽。它的核心目标是:
- 轻度聚合:对DWD层的明细数据进行按主题的聚合(如按天、按区域、按商品类目)。
- 服务化接口:为报表、BI工具、实时看板等提供标准化的宽表数据。
- 性能优化:通过预计算和索引设计,减少上层应用的查询复杂度。
宽表建模 是DWS层的核心手段,其本质是将多个维度表和实际表的字段整合到一张表中,形成”即插即用”的数据服务层。
二、10个经典宽表建模场景与完整代码
1. 商品主题宽表:dws_goods_detail
场景:电商/零售企业需要分析商品销售趋势、库存周转率、复购率等指标。
1)核心字段
|
字段名 |
类型 |
描述 |
示例值 |
|
goods_id |
VARCHAR(32) |
商品唯一ID |
“G1001” |
|
category_id |
VARCHAR(16) |
商品类目ID |
“C5001” |
|
category_name |
VARCHAR(64) |
商品类目名称 |
“手机数码” |
|
brand_id |
VARCHAR(16) |
品牌ID |
“B2001” |
|
brand_name |
VARCHAR(32) |
品牌名称 |
“华为” |
|
price |
DECIMAL(10,2) |
商品价格 |
2999.00 |
|
sale_count |
INT |
销售数量(日) |
150 |
|
sale_amount |
DECIMAL(12,2) |
销售金额(日) |
449850.00 |
|
return_rate |
DECIMAL(5,2) |
退货率(%) |
1.50 |
|
inventory |
INT |
当前库存量 |
2000 |
|
inventory_turnover_days |
DECIMAL(5,2) |
库存周转天数 |
15.20 |
|
repurchase_rate |
DECIMAL(5,2) |
复购率(%) |
25.30 |
|
update_time |
DATETIME |
数据更新时间 |
“2023-09-07 15:30:00” |
2)建表SQL
CREATE TABLE dws_goods_detail (
goods_id VARCHAR(32) COMMENT'商品唯一ID',
category_id VARCHAR(16) COMMENT'商品类目ID',
category_name VARCHAR(64) COMMENT'商品类目名称',
brand_id VARCHAR(16) COMMENT'品牌ID',
brand_name VARCHAR(32) COMMENT'品牌名称',
price DECIMAL(10,2) COMMENT'商品价格',
sale_count INTCOMMENT'销售数量(日)',
sale_amount DECIMAL(12,2) COMMENT'销售金额(日)',
return_rate DECIMAL(5,2) COMMENT'退货率(%)',
inventory INTCOMMENT'当前库存量',
inventory_turnover_days DECIMAL(5,2) COMMENT'库存周转天数',
repurchase_rate DECIMAL(5,2) COMMENT'复购率(%)',
update_time DATETIME COMMENT'数据更新时间',
PRIMARY KEY (goods_id, update_time)
) COMMENT'商品主题宽表'
DISTRIBUTEDBYHASH(goods_id) BUCKETS 32
PROPERTIES (
"replication_num" = "1"
);
3)开发SQL代码
INSERT INTO dws_goods_detail
SELECT
g.goods_id,
g.category_id,
c.category_name,
g.brand_id,
b.brand_name,
g.price,
SUM(o.quantity) AS sale_count,
SUM(o.quantity * o.price) AS sale_amount,
ROUND(SUM(o.return_quantity) * 100.0 / NULLIF(SUM(o.quantity), 0), 2) AS return_rate,
i.inventory,
ROUND(30.0 * SUM(o.quantity) / NULLIF(SUM(o.quantity), 0), 2) AS inventory_turnover_days,
ROUND(COUNT(DISTINCT o.user_id) * 100.0 / NULLIF(COUNT(DISTINCT o.user_id), 0), 2) AS repurchase_rate,
NOW() AS update_time
FROM dwd_order_detail o
JOIN dwd_goods_info g ON o.goods_id = g.goods_id
JOIN dwd_category_info c ON g.category_id = c.category_id
JOIN dwd_brand_info b ON g.brand_id = b.brand_id
JOIN dwd_inventory i ON g.goods_id = i.goods_id
WHERE o.order_date = CURDATE()
GROUPBY g.goods_id, g.category_id, c.category_name, g.brand_id, b.brand_name, g.price, i.inventory;
2. 用户主题宽表:dws_user_detail
场景:用户画像分析、会员等级管理、营销效果评估。
1)核心字段
|
字段名 |
类型 |
描述 |
示例值 |
|
user_id |
VARCHAR(32) |
用户唯一ID |
“U2001” |
|
register_channel |
VARCHAR(32) |
注册渠道 |
“APP” |
|
member_level |
VARCHAR(16) |
会员等级 |
“VIP3” |
|
region_id |
VARCHAR(16) |
地域ID |
“R1001” |
|
region_name |
VARCHAR(64) |
地域名称 |
“广东省” |
|
device_type |
VARCHAR(16) |
设备类型 |
“Android” |
|
active_days |
INT |
活跃天数(30天) |
25 |
|
order_count |
INT |
下单次数(30天) |
8 |
|
avg_order_amount |
DECIMAL(10,2) |
平均客单价 |
350.50 |
|
lifecycle_value |
DECIMAL(12,2) |
生命周期价值 |
2800.00 |
|
last_order_date |
DATE |
最后下单日期 |
“2023-09-05” |
|
update_time |
DATETIME |
数据更新时间 |
“2023-09-07 15:30:00” |
2)建表SQL
CREATE TABLE dws_user_detail (
user_id VARCHAR(32) COMMENT'用户唯一ID',
register_channel VARCHAR(32) COMMENT'注册渠道',
member_level VARCHAR(16) COMMENT'会员等级',
region_id VARCHAR(16) COMMENT'地域ID',
region_name VARCHAR(64) COMMENT'地域名称',
device_type VARCHAR(16) COMMENT'设备类型',
active_days INTCOMMENT'活跃天数(30天)',
order_count INTCOMMENT'下单次数(30天)',
avg_order_amount DECIMAL(10,2) COMMENT'平均客单价',
lifecycle_value DECIMAL(12,2) COMMENT'生命周期价值',
last_order_date DATECOMMENT'最后下单日期',
update_time DATETIME COMMENT'数据更新时间',
PRIMARY KEY (user_id)
) COMMENT'用户主题宽表'
DISTRIBUTEDBYHASH(user_id) BUCKETS 32
PROPERTIES (
"replication_num" = "1"
);
3)数据SQL
INSERT INTO dws_user_detail
SELECT
u.user_id,
u.register_channel,
u.member_level,
u.region_id,
r.region_name,
u.device_type,
COUNT(DISTINCT o.order_date) AS active_days,
COUNT(o.order_id) AS order_count,
AVG(o.amount) AS avg_order_amount,
SUM(o.amount) AS lifecycle_value,
MAX(o.order_date) AS last_order_date,
NOW() AS update_time
FROM dwd_user_info u
JOIN dwd_region_info r ON u.region_id = r.region_id
JOIN dwd_order_detail o ON u.user_id = o.user_id
WHERE o.order_date >= DATE_SUB(CURRENT_DATE, 30)
GROUPBY u.user_id, u.register_channel, u.member_level, u.region_id, r.region_name, u.device_type;
3. 订单主题宽表:dws_order_detail
场景:订单履约分析、退款率监控、供应链优化。
1)核心字段设计
|
字段名 |
类型 |
描述 |
示例值 |
|
order_id |
VARCHAR(32) |
订单ID |
“O202309070001” |
|
order_date |
DATE |
订单日期 |
“2023-09-07” |
|
pay_method |
VARCHAR(16) |
支付方式 |
“微信支付” |
|
delivery_channel |
VARCHAR(32) |
配送渠道 |
“顺丰速运” |
|
order_status |
VARCHAR(16) |
订单状态 |
“已完成” |
|
amount |
DECIMAL(12,2) |
订单金额 |
399.00 |
|
refund_amount |
DECIMAL(12,2) |
退款金额 |
0.00 |
|
refund_rate |
DECIMAL(5,2) |
退款率(%) |
0.00 |
|
delivery_time |
INT |
配送时长(小时) |
24 |
|
delivery_cost |
DECIMAL(8,2) |
物流成本 |
10.00 |
|
update_time |
DATETIME |
数据更新时间 |
“2023-09-07 15:30:00” |
2)建表SQL
CREATE TABLE dws_order_detail (
order_id VARCHAR(32) COMMENT'订单ID',
order_date DATECOMMENT'订单日期',
pay_method VARCHAR(16) COMMENT'支付方式',
delivery_channel VARCHAR(32) COMMENT'配送渠道',
order_status VARCHAR(16) COMMENT'订单状态',
amount DECIMAL(12,2) COMMENT'订单金额',
refund_amount DECIMAL(12,2) COMMENT'退款金额',
refund_rate DECIMAL(5,2) COMMENT'退款率(%)',
delivery_time INTCOMMENT'配送时长(小时)',
delivery_cost DECIMAL(8,2) COMMENT'物流成本',
update_time DATETIME COMMENT'数据更新时间',
PRIMARY KEY (order_id)
) COMMENT'订单主题宽表'
DISTRIBUTEDBYHASH(order_id) BUCKETS 32
PROPERTIES (
"replication_num" = "1"
);
3)数据SQL示例
INSERT INTO dws_order_detail
SELECT
o.order_id,
o.order_date,
o.pay_method,
o.delivery_channel,
o.order_status,
o.amount,
COALESCE(r.refund_amount, 0) AS refund_amount,
ROUND(COALESCE(r.refund_amount, 0) * 100.0 / NULLIF(o.amount, 0), 2) AS refund_rate,
DATEDIFF(o.delivery_time, o.order_time) * 24 + EXTRACT(HOURFROM (o.delivery_time - o.order_time)) AS delivery_time,
d.delivery_cost,
NOW() AS update_time
FROM dwd_order_fact o
LEFTJOIN dwd_refund_info r ON o.order_id = r.order_id
LEFTJOIN dwd_delivery_info d ON o.order_id = d.order_id
WHERE o.order_date = CURDATE();
4. 地区主题宽表:dws_region_detail
场景:区域市场分析、门店选址评估、区域库存分配。
1)核心字段设计
|
字段名 |
类型 |
描述 |
示例值 |
|
region_id |
VARCHAR(16) |
地域ID |
“R1001” |
|
province |
VARCHAR(32) |
省份 |
“广东省” |
|
city |
VARCHAR(32) |
城市 |
“广州市” |
|
district |
VARCHAR(32) |
区域 |
“天河区” |
|
store_id |
VARCHAR(16) |
门店ID |
“S1001” |
|
sales_amount |
DECIMAL(12,2) |
销售额(日) |
158000.00 |
|
customer_count |
INT |
客户数(日) |
350 |
|
conversion_rate |
DECIMAL(5,2) |
转化率(%) |
12.50 |
|
inventory_turnover |
DECIMAL(5,2) |
库存周转率 |
5.20 |
|
avg_order_value |
DECIMAL(10,2) |
平均订单价值 |
450.00 |
|
update_time |
DATETIME |
数据更新时间 |
“2023-09-07 15:30:00” |
2)建表SQL
CREATE TABLE dws_region_detail (
region_id VARCHAR(16) COMMENT'地域ID',
province VARCHAR(32) COMMENT'省份',
city VARCHAR(32) COMMENT'城市',
district VARCHAR(32) COMMENT'区域',
store_id VARCHAR(16) COMMENT'门店ID',
sales_amount DECIMAL(12,2) COMMENT'销售额(日)',
customer_count INTCOMMENT'客户数(日)',
conversion_rate DECIMAL(5,2) COMMENT'转化率(%)',
inventory_turnover DECIMAL(5,2) COMMENT'库存周转率',
avg_order_value DECIMAL(10,2) COMMENT'平均订单价值',
update_time DATETIME COMMENT'数据更新时间',
PRIMARY KEY (region_id, store_id)
) COMMENT'地区主题宽表'
DISTRIBUTEDBYHASH(region_id) BUCKETS 32
PROPERTIES (
"replication_num" = "1"
);
3)开发SQL示例
INSERT INTO dws_region_detail
SELECT
r.region_id,
r.province,
r.city,
r.district,
s.store_id,
SUM(o.amount) AS sales_amount,
COUNT(DISTINCT o.user_id) AS customer_count,
ROUND(COUNT(DISTINCT o.user_id) * 100.0 / NULLIF(COUNT(o.order_id), 0), 2) AS conversion_rate,
ROUND(SUM(o.quantity) * 30.0 / NULLIF(SUM(i.inventory), 0), 2) AS inventory_turnover,
AVG(o.amount) AS avg_order_value,
NOW() AS update_time
FROM dwd_order_detail o
JOIN dwd_region_info r ON o.region_id = r.region_id
JOIN dwd_store_info s ON r.region_id = s.region_id
JOIN dwd_inventory i ON o.goods_id = i.goods_id
WHERE o.order_date = CURDATE()
GROUPBY r.region_id, r.province, r.city, r.district, s.store_id;
5. 渠道主题宽表:dws_channel_detail
场景:营销渠道效果分析、ROI计算、投放策略优化。
1)核心字段设计
|
字段名 |
类型 |
描述 |
示例值 |
|
channel_id |
VARCHAR(16) |
渠道ID |
“C1001” |
|
channel_name |
VARCHAR(32) |
渠道名称 |
“抖音信息流” |
|
campaign_id |
VARCHAR(16) |
活动ID |
“CAM20230901” |
|
campaign_name |
VARCHAR(64) |
活动名称 |
“9月新品促销” |
|
impressions |
BIGINT |
曝光量 |
50000 |
|
clicks |
BIGINT |
点击量 |
2500 |
|
click_rate |
DECIMAL(5,2) |
点击率(%) |
5.00 |
|
conversions |
INT |
转化量 |
120 |
|
conversion_rate |
DECIMAL(5,2) |
转化率(%) |
4.80 |
|
cac |
DECIMAL(10,2) |
获客成本 |
83.33 |
|
update_time |
DATETIME |
数据更新时间 |
“2023-09-07 15:30:00” |
2)建表SQL
CREATE TABLE dws_channel_detail (
channel_id VARCHAR(16) COMMENT'渠道ID',
channel_name VARCHAR(32) COMMENT'渠道名称',
campaign_id VARCHAR(16) COMMENT'活动ID',
campaign_name VARCHAR(64) COMMENT'活动名称',
impressions BIGINTCOMMENT'曝光量',
clicks BIGINTCOMMENT'点击量',
click_rate DECIMAL(5,2) COMMENT'点击率(%)',
conversions INTCOMMENT'转化量',
conversion_rate DECIMAL(5,2) COMMENT'转化率(%)',
cac DECIMAL(10,2) COMMENT'获客成本',
update_time DATETIME COMMENT'数据更新时间',
PRIMARY KEY (channel_id, campaign_id)
) COMMENT'渠道主题宽表'
DISTRIBUTEDBYHASH(channel_id) BUCKETS 32
PROPERTIES (
"replication_num" = "1"
);
3)开发SQL代码
INSERT INTO dws_channel_detail
SELECT
c.channel_id,
c.channel_name,
cam.campaign_id,
cam.campaign_name,
SUM(a.impressions) AS impressions,
SUM(a.clicks) AS clicks,
ROUND(SUM(a.clicks) * 100.0 / NULLIF(SUM(a.impressions), 0), 2) AS click_rate,
COUNT(DISTINCT o.order_id) AS conversions,
ROUND(COUNT(DISTINCT o.order_id) * 100.0 / NULLIF(SUM(a.clicks), 0), 2) AS conversion_rate,
ROUND(SUM(a.cost) / NULLIF(COUNT(DISTINCT o.user_id), 0), 2) AS cac,
NOW() AS update_time
FROM dwd_ad_activity a
JOIN dwd_channel_info c ON a.channel_id = c.channel_id
JOIN dwd_campaign_info cam ON a.campaign_id = cam.campaign_id
LEFTJOIN dwd_order_detail o ON a.user_id = o.user_id
WHERE a.date = CURDATE()
GROUPBY c.channel_id, c.channel_name, cam.campaign_id, cam.campaign_name;
6. 供应链主题宽表:dws_supplier_detail
场景:供应商绩效评估、采购成本分析、库存周转监控。
1)核心字段设计
|
字段名 |
类型 |
描述 |
示例值 |
|
supplier_id |
VARCHAR(16) |
供应商ID |
“S2001” |
|
supplier_name |
VARCHAR(64) |
供应商名称 |
“华强电子” |
|
category_id |
VARCHAR(16) |
采购品类ID |
“C5001” |
|
category_name |
VARCHAR(64) |
采购品类名称 |
“手机配件” |
|
purchase_amount |
DECIMAL(12,2) |
采购金额(月) |
125000.00 |
|
delivery_on_time_rate |
DECIMAL(5,2) |
交货准时率(%) |
95.50 |
|
inventory_turnover |
DECIMAL(5,2) |
库存周转率 |
6.20 |
|
purchase_cost |
DECIMAL(10,2) |
采购成本 |
85.00 |
|
lead_time |
INT |
采购提前期(天) |
15 |
|
update_time |
DATETIME |
数据更新时间 |
“2023-09-07 15:30:00” |
2)建表SQL
CREATE TABLE dws_supplier_detail (
supplier_id VARCHAR(16) COMMENT'供应商ID',
supplier_name VARCHAR(64) COMMENT'供应商名称',
category_id VARCHAR(16) COMMENT'采购品类ID',
category_name VARCHAR(64) COMMENT'采购品类名称',
purchase_amount DECIMAL(12,2) COMMENT'采购金额(月)',
delivery_on_time_rate DECIMAL(5,2) COMMENT'交货准时率(%)',
inventory_turnover DECIMAL(5,2) COMMENT'库存周转率',
purchase_cost DECIMAL(10,2) COMMENT'采购成本',
lead_time INTCOMMENT'采购提前期(天)',
update_time DATETIME COMMENT'数据更新时间',
PRIMARY KEY (supplier_id, category_id)
) COMMENT'供应链主题宽表'
DISTRIBUTEDBYHASH(supplier_id) BUCKETS 32
PROPERTIES (
"replication_num" = "1"
);
3)开发SQL示例
INSERT INTO dws_supplier_detail
SELECT
s.supplier_id,
s.supplier_name,
p.category_id,
c.category_name,
SUM(p.amount) AS purchase_amount,
ROUND(SUM(CASEWHEN p.delivery_date <= p.expected_delivery_date THEN1ELSE0END) * 100.0 / NULLIF(COUNT(p.purchase_id), 0), 2) AS delivery_on_time_rate,
ROUND(SUM(p.quantity) * 30.0 / NULLIF(SUM(i.inventory), 0), 2) AS inventory_turnover,
AVG(p.unit_price) AS purchase_cost,
AVG(DATEDIFF(p.delivery_date, p.purchase_date)) AS lead_time,
NOW() AS update_time
FROM dwd_purchase_order p
JOIN dwd_supplier_info s ON p.supplier_id = s.supplier_id
JOIN dwd_category_info c ON p.category_id = c.category_id
JOIN dwd_inventory i ON p.goods_id = i.goods_id
WHERE p.purchase_date >= DATE_SUB(CURRENT_DATE, 30)
GROUPBY s.supplier_id, s.supplier_name, p.category_id, c.category_name;
7. 客户服务主题宽表:dws_service_detail
场景:客服响应率分析、投诉率监控、客户满意度评估。
1)核心字段设计
|
字段名 |
类型 |
描述 |
示例值 |
|
service_id |
VARCHAR(32) |
服务ID |
“SVC202309070001” |
|
service_channel |
VARCHAR(16) |
服务渠道 |
“在线客服” |
|
service_type |
VARCHAR(32) |
问题类型 |
“订单退款” |
|
agent_id |
VARCHAR(16) |
客服ID |
“A1001” |
|
response_time |
INT |
响应时长(分钟) |
5 |
|
resolution_rate |
DECIMAL(5,2) |
解决率(%) |
92.30 |
|
complaint_count |
INT |
投诉次数 |
2 |
|
satisfaction_score |
DECIMAL(5,2) |
满意度评分 |
4.7 |
|
update_time |
DATETIME |
数据更新时间 |
“2023-09-07 15:30:00” |
2)建表SQL
CREATE TABLE dws_service_detail (
service_id VARCHAR(32) COMMENT'服务ID',
service_channel VARCHAR(16) COMMENT'服务渠道',
service_type VARCHAR(32) COMMENT'问题类型',
agent_id VARCHAR(16) COMMENT'客服ID',
response_time INTCOMMENT'响应时长(分钟)',
resolution_rate DECIMAL(5,2) COMMENT'解决率(%)',
complaint_count INTCOMMENT'投诉次数',
satisfaction_score DECIMAL(5,2) COMMENT'满意度评分',
update_time DATETIME COMMENT'数据更新时间',
PRIMARY KEY (service_id)
) COMMENT'客户服务主题宽表'
DISTRIBUTEDBYHASH(service_id) BUCKETS 32
PROPERTIES (
"replication_num" = "1"
);
3)SQL代码
INSERT INTO dws_service_detail
SELECT
s.service_id,
s.service_channel,
s.service_type,
s.agent_id,
AVG(TIMESTAMPDIFF(MINUTE, s.create_time, s.response_time)) AS response_time,
ROUND(SUM(CASEWHEN s.status = '已解决'THEN1ELSE0END) * 100.0 / NULLIF(COUNT(*), 0), 2) AS resolution_rate,
COUNT(CASEWHEN s.complaint = 1THEN1END) AS complaint_count,
AVG(s.satisfaction_score) AS satisfaction_score,
NOW() AS update_time
FROM dwd_service_record s
WHERE s.create_time >= DATE_SUB(CURRENT_DATE, 7)
GROUPBY s.service_id, s.service_channel, s.service_type, s.agent_id;
8. 财务主题宽表:dws_finance_detail
场景:成本核算、利润分析、现金流监控。
1)核心字段设计
|
字段名 |
类型 |
描述 |
示例值 |
|
finance_id |
VARCHAR(16) |
财务ID |
“F20230907” |
|
department |
VARCHAR(32) |
部门 |
“市场部” |
|
project_id |
VARCHAR(16) |
项目ID |
“P20230901” |
|
project_name |
VARCHAR(64) |
项目名称 |
“9月新品推广” |
|
revenue |
DECIMAL(12,2) |
收入 |
250000.00 |
|
cost |
DECIMAL(12,2) |
成本 |
180000.00 |
|
gross_profit |
DECIMAL(12,2) |
毛利 |
70000.00 |
|
gross_profit_rate |
DECIMAL(5,2) |
毛利率(%) |
28.00 |
|
cash_flow |
DECIMAL(12,2) |
现金流 |
65000.00 |
|
update_time |
DATETIME |
数据更新时间 |
“2023-09-07 15:30:00” |
2)建表SQL
CREATE TABLE dws_finance_detail (
finance_id VARCHAR(16) COMMENT'财务ID',
department VARCHAR(32) COMMENT'部门',
project_id VARCHAR(16) COMMENT'项目ID',
project_name VARCHAR(64) COMMENT'项目名称',
revenue DECIMAL(12,2) COMMENT'收入',
costDECIMAL(12,2) COMMENT'成本',
gross_profit DECIMAL(12,2) COMMENT'毛利',
gross_profit_rate DECIMAL(5,2) COMMENT'毛利率(%)',
cash_flow DECIMAL(12,2) COMMENT'现金流',
update_time DATETIME COMMENT'数据更新时间',
PRIMARY KEY (finance_id)
) COMMENT'财务主题宽表'
DISTRIBUTEDBYHASH(finance_id) BUCKETS 32
PROPERTIES (
"replication_num" = "1"
);
3)SQL示例
INSERT INTO dws_finance_detail
SELECT
f.finance_id,
f.department,
f.project_id,
p.project_name,
SUM(o.amount) AS revenue,
SUM(c.cost) AScost,
SUM(o.amount) - SUM(c.cost) AS gross_profit,
ROUND((SUM(o.amount) - SUM(c.cost)) * 100.0 / NULLIF(SUM(o.amount), 0), 2) AS gross_profit_rate,
SUM(o.amount) - SUM(c.cost) AS cash_flow,
NOW() AS update_time
FROM dwd_order_fact o
JOIN dwd_project_info p ON o.project_id = p.project_id
JOIN dwd_cost_info c ON o.project_id = c.project_id
WHERE o.order_date >= DATE_SUB(CURRENT_DATE, 30)
GROUPBY f.finance_id, f.department, f.project_id, p.project_name;
9. 物流主题宽表:dws_logistics_detail
场景:物流时效分析、运输成本优化、异常订单监控。
1)核心字段设计
|
字段名 |
类型 |
描述 |
示例值 |
|
logistics_id |
VARCHAR(32) |
物流ID |
“L202309070001” |
|
logistics_company |
VARCHAR(32) |
物流公司 |
“顺丰速运” |
|
transport_type |
VARCHAR(16) |
运输方式 |
“快递” |
|
region |
VARCHAR(32) |
区域 |
“华东地区” |
|
delivery_time |
INT |
配送时长(小时) |
24 |
|
delivery_cost |
DECIMAL(8,2) |
运输成本 |
15.00 |
|
on_time_rate |
DECIMAL(5,2) |
准时率(%) |
96.50 |
|
exception_order_count |
INT |
异常订单数 |
2 |
|
update_time |
DATETIME |
数据更新时间 |
“2023-09-07 15:30:00” |
2)建表SQL
CREATE TABLE dws_logistics_detail (
logistics_id VARCHAR(32) COMMENT'物流ID',
logistics_company VARCHAR(32) COMMENT'物流公司',
transport_type VARCHAR(16) COMMENT'运输方式',
region VARCHAR(32) COMMENT'区域',
delivery_time INTCOMMENT'配送时长(小时)',
delivery_cost DECIMAL(8,2) COMMENT'运输成本',
on_time_rate DECIMAL(5,2) COMMENT'准时率(%)',
exception_order_count INTCOMMENT'异常订单数',
update_time DATETIME COMMENT'数据更新时间',
PRIMARY KEY (logistics_id)
) COMMENT'物流主题宽表'
DISTRIBUTEDBYHASH(logistics_id) BUCKETS 32
PROPERTIES (
"replication_num" = "1"
);
3)SQL代码
INSERT INTO dws_logistics_detail
SELECT
l.logistics_id,
l.logistics_company,
l.transport_type,
l.region,
AVG(TIMESTAMPDIFF(HOUR, o.order_time, l.delivery_time)) AS delivery_time,
AVG(l.cost) AS delivery_cost,
ROUND(SUM(CASEWHEN l.delivery_time <= o.expected_delivery_time THEN1ELSE0END) * 100.0 / NULLIF(COUNT(*), 0), 2) AS on_time_rate,
COUNT(CASEWHEN l.delivery_time > o.expected_delivery_time THEN1END) AS exception_order_count,
NOW() AS update_time
FROM dwd_order_detail o
JOIN dwd_logistics_info l ON o.order_id = l.order_id
WHERE o.order_date = CURDATE()
GROUPBY l.logistics_id, l.logistics_company, l.transport_type, l.region;
10. 数据质量监控宽表:dws_data_quality
场景:数据异常检测、ETL流程监控、数据一致性校验。
1)核心字段设计
|
字段名 |
类型 |
描述 |
示例值 |
|
table_name |
VARCHAR(64) |
表名 |
“dwd_order_detail” |
|
column_name |
VARCHAR(64) |
字段名 |
“order_id” |
|
data_type |
VARCHAR(16) |
数据类型 |
“VARCHAR” |
|
null_rate |
DECIMAL(5,2) |
空值率(%) |
0.50 |
|
duplicate_rate |
DECIMAL(5,2) |
重复率(%) |
0.20 |
|
data_lag |
INT |
数据延迟(小时) |
2 |
|
consistency_rate |
DECIMAL(5,2) |
一致性率(%) |
99.80 |
|
check_time |
DATETIME |
检查时间 |
“2023-09-07 15:30:00” |
|
status |
VARCHAR(16) |
状态 |
“正常” |
2)建表SQL
CREATE TABLE dws_data_quality (
table_name VARCHAR(64) COMMENT'表名',
column_name VARCHAR(64) COMMENT'字段名',
data_type VARCHAR(16) COMMENT'数据类型',
null_rate DECIMAL(5,2) COMMENT'空值率(%)',
duplicate_rate DECIMAL(5,2) COMMENT'重复率(%)',
data_lag INTCOMMENT'数据延迟(小时)',
consistency_rate DECIMAL(5,2) COMMENT'一致性率(%)',
check_time DATETIME COMMENT'检查时间',
statusVARCHAR(16) COMMENT'状态',
PRIMARY KEY (table_name, column_name, check_time)
) COMMENT'数据质量监控宽表'
DISTRIBUTEDBYHASH(table_name) BUCKETS 32
PROPERTIES (
"replication_num" = "1"
);
3)SQL示例
INSERT INTO dws_data_quality
SELECT
'dwd_order_detail'AS table_name,
'order_id'AS column_name,
'VARCHAR'AS data_type,
ROUND(COUNT(*) * 100.0 / NULLIF((SELECTCOUNT(*) FROM dwd_order_detail), 0), 2) AS null_rate,
ROUND(COUNT(*) * 100.0 / NULLIF((SELECTCOUNT(*) FROM dwd_order_detail), 0), 2) AS duplicate_rate,
TIMESTAMPDIFF(HOUR, MAX(order_time), NOW()) AS data_lag,
ROUND(COUNT(*) * 100.0 / NULLIF((SELECTCOUNT(*) FROM dwd_order_detail), 0), 2) AS consistency_rate,
NOW() AS check_time,
CASE
WHENROUND(COUNT(*) * 100.0 / NULLIF((SELECTCOUNT(*) FROM dwd_order_detail), 0), 2) < 5AND
TIMESTAMPDIFF(HOUR, MAX(order_time), NOW()) < 4THEN'正常'
ELSE'异常'
ENDASstatus
FROM dwd_order_detail
WHERE order_time < DATE_SUB(NOW(), INTERVAL1HOUR);
三、DWS宽表建模的黄金法则
主题驱动:宽表必须围绕明确的业务主题(如商品、用户)设计,避免”万能宽表”。
轻度聚合:聚合粒度不宜过细(如按天),也不宜过粗(如按月),需根据业务需求调整。
维度统一:确保宽表中的维度与DWD层一致,避免口径不一致导致的分析混乱。
性能优先:通过预计算和索引设计(如ClickHouse的ReplacingMergeTree引擎)提升查询效率。
公共性原则:宽表需服务于多个上层应用(如报表、BI工具),避免重复开发。
作者介绍
范老师,资深数仓专家,10年数据老兵。
来源丨公众号:数据仓库与Python大数据(ID:edw_bigdata)
dbaplus社群欢迎广大技术人员投稿,投稿邮箱:editor@dbaplus.cn

收藏了,感谢分享