Neural Decision Forest (ICCV 2015)

6 min readMay 27, 2021

最近因為疫情關係，都在家工作，有些設備在公司不方便用，結果慢慢發現能做的事都做得差不多了。既然如此，就來看 paper 寫 code 消遣一下。

最近看比較多的都是 deep learning 結合 decision tree 的模型居多，今天看的這篇 Deep Neural Decision Forests 網路上實作也很多，我自己是參考了 jingxil 的實作，但我自己不喜歡他/她 Tree 的實作方式，所以我自己寫了一個。放進去 training 之後也可以重現實驗結果，所以我想我應該沒寫錯，只是我的實作需要的記憶體少了一些。想說就來寫一篇紀錄一下實作過程的一些細節。

Binary Decision Tree 和 Stochastic Routing

以 Decision Forest 來說，核心的物件就是 decision tree，其中這篇 paper 比較特別的部分就是所謂的 stochastic routing。白話來說，就是 tree 中的每個decision node 都是一個 Bernoulli distribution 去控制會往左或往右，所以說在 leaf node 上會呈現一個隨機分佈，每筆資料或落在哪個 leaf node 上是隨機的。

除此之外，利用 Deep Learning 的 feature extraction 的能力，每個 decision node i 上向左走的機率 p_i 也是 input x 的函數，也就是 p_i(x)。

概念很簡單，但具體上怎麼實作呢？

我想對於 Binary Tree 的實作熟悉的讀者，應該很熟悉如何把 Binary Tree 寫成一個線性的結構，也就是用一個 array/list 去實作 Binary Tree，不熟悉的朋友可以參考這邊的說明。

而我們的目標，是算出 stochastic routing 下，leaf nodes 的 distribution 是多少。我的實作在這裡：

dboyliao/Neural-Decision-Forests

An implementation of the Deep Neural Decision Forests in PyTorch - dboyliao/Neural-Decision-Forests

github.com

我實作的概念大致上是這樣，以一個深度是 3 的 decision tree 來說，有 7 個 decision nodes 跟 8 個 leaf nodes：

透過 self.decision 輸出每個 decision node 上 Bernoulli distribution 的參數。以深度為 3 的 tree 來說，self.decision 會輸出一個 N x 8的 tensor，令其為 D。
給定目前在第 l 層的節點機率的分佈是 P_l，這些節點對應的 D 的 tensor slice 為 D_l (譬如第二層有 2 個節點，D_l = D[:, 1:3])，則下一層節點分佈 P_{l+1} 可以表示成：

因此實作上來說，透過 torch.repeat_interleave 把 P_l repeat 2 次，再把 D_l 整理出向左跟向右走的機率，進行 element-wise 相乘就可以得到下一層的機率分佈，重複這個動作就可以得到 leaf nodes 的分佈。另外因為 D 是 self.decision 的輸出，所以 D 其實會隨著 input 的 x 去做改變，也就是不同資料會有不同的 decision nodes，所以也會有不同的 leaf nodes 分佈，這就是 stochastic routing 跟一般的 deterministic 的 tree routing 的差別。具體來說就是下面這段程式在做的事：

Two Stage Training

在 tree routing 搞定之後，另一個 decision tree 必須處理的問題是在 leaf nodes 上的 value 是多少。以原始 paper 來說，因為是考慮分類器，所以 leaf nodes 上的 value 就是每個分類的機率分佈是多少。至於這些分佈要怎麼透過資料訓練出來呢？主要是透過兩階段訓練：

首先固定每個 leaf nodes 的 value (最初起始值為 uniform distribution)，訓練所有 tree 的 decision module ( self.decision )
再來固定每個 tree 的 decision module，更新 leaf nodes 的 value 到給定目前 routing 的機率下的理論最佳值，也就是 paper 裡的第 11 式

重複以上流程完整訓練整個 random forest。

基本上 leaf nodes 的 value 理論最佳值就是被 route 到該 node 上所有的 label 的加權平均，相關證明篇幅頗長，就麻煩自己翻翻 paper 啦，基本上就是下面這個式子：

Stochastic Routing As Feature Extractor

如果細看 jingxil 的實作，會發現有個 jointly_training 的選項可以使用。

這裡是跟原始 paper 不太一樣的地方。使用下去後，neural decision forest 彷彿就是一個 feature extractor 。

首先，我們的 forest 其實可以把每個 x 轉換成一個 N x num_leafs 的 tensor 。如果我們把這個 tensor 當作 feature 餵給另一個 module (譬如說 fully-connected layer)，把它轉換成 N x num_class 的機率分佈，則 forest 跟這個 module 其實可以一起做 end-to-end 的訓練。也就是說，可以把 forest 當作一種 feature extractor 來使用。

實作完之後跑起來，在 MNIST 上可以達到 99% 以上的準確率：

後記

整個實作基本上就是這樣，多學了一個 stochastic routing 其實是蠻有趣的。而且從 stochastic routing 的輸出來說，其實也可以當作一個 attention 來使用，如果之後有機會的話，或許我會試試看，反正我 code 都寫了，可以直接小改就變 attention 拿來用沒問題 XD

有機會再說囉，Happy Python Programming !

Neural Decision Forest (ICCV 2015)

Binary Decision Tree 和 Stochastic Routing

dboyliao/Neural-Decision-Forests

An implementation of the Deep Neural Decision Forests in PyTorch - dboyliao/Neural-Decision-Forests

Two Stage Training

Stochastic Routing As Feature Extractor

後記

References

Written by Dboy Liao

No responses yet