Algorithm » JasonLe's TechBlog

Archive for the ‘Algorithm’ category

主成分分析（Principal components analysis）的计算过程

April 28th, 2017

今天使用Octave练习计算PCA，发现自己对于这个计算还不是特别了解，特此记录。

首先从txt中获取数据，然后将矩阵都入到X中。

octave:7> data = load('ex2data.txt')
data =

   2.50000   2.40000
   0.50000   0.70000
   2.20000   2.90000
   1.90000   2.20000
   3.10000   3.00000
   2.30000   2.70000
   2.00000   1.60000
   1.00000   1.10000
   1.50000   1.60000
   1.10000   0.90000
octave:8> X = data(:, [1, 2]);
octave:15> mu=mean(X)
mu =

   1.8100   1.9100

求x平均值，然后对于所有的样例，都减去对应的均值。这里x的均值是1.81和1.91，那么一个样例减去均值后即为（0.69,0.49），得到

octave:16> X_norm = bsxfun(@minus, X, mu);
octave:17> X_norm 
X_norm =

   0.690000   0.490000
  -1.310000  -1.210000
   0.390000   0.990000
   0.090000   0.290000
   1.290000   1.090000
   0.490000   0.790000
   0.190000  -0.310000
  -0.810000  -0.810000
  -0.310000  -0.310000
  -0.710000  -1.010000

我们使用这个矩阵去构造协方差矩阵sigma = （X_norm’ * X_norm)/size(X_norm(:,1))

octave:34> sigma = X_norm' * X_norm
sigma =

   5.5490   5.5390
   5.5390   6.4490

octave:35> sigma = sigma/10
sigma =

   0.55490   0.55390
   0.55390   0.64490

使用octave中的svd函数直接计算出协方差的特征值和特征向量

octave:36> [U,S,V] = svd(sigma)
U =

  -0.67787  -0.73518
  -0.73518   0.67787

S =

Diagonal Matrix

   1.155625          0
          0   0.044175

V =

  -0.67787  -0.73518
  -0.73518   0.67787

U为特征向量，S为sigma的特征值，通过特征值可以求变量的retained（保留程度）。
比如此时我们想把2维数据转换到1维数据上，那么可以使用U的第一列或者第二列，通过公式z=U’ * X_norm或者X_norm * U得到降维后的数据。

octave:39> z = X_norm*U(:,1)
z =

  -0.827970
   1.777580
  -0.992197
  -0.274210
  -1.675801
  -0.912949
   0.099109
   1.144572
   0.438046
   1.223821

研究其数理意义，就是求源数据到向量基z的投影误差最小，找到合适的基向量代表这个平面。
http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020209.html

http://www.cnblogs.com/LeftNotEasy/archive/2011/01/19/svd-and-applications.html

No comments »

Posted in Algorithm, Machine Learning

Tags: PCA

修正Neural Network参数小结

April 14th, 2017

在实际构建神经网络的过程中，经常碰到一些选择的问题，现在进行总结：

Getting more training examples: Fixes high variance

Trying smaller sets of features: Fixes high variance

Adding features: Fixes high bias

Adding polynomial features: Fixes high bias

Decreasing λ: Fixes high bias

Increasing λ: Fixes high variance.

当遇到高差异性时（high variance），可以试图增加训练样本或者减少特征数量来解决，但是如果遇到高偏见性（high bias），那么就表明这个训练集可能特征数太少，需要增加特征。λ作为惩罚系数存在，λ越大，惩罚系数越大，越可以修正高差异性，反之修正高偏见性。对于λ的取值，一般遵循在cross-validation set中取最优来决定。

Diagnosing Neural Networks

A neural network with fewer parameters is prone to underfitting. It is also computationally cheaper.
A large neural network with more parameters is prone to overfitting. It is also computationally expensive. In this case you can use regularization (increase λ) to address the overfitting.

Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set. You can then select the one that performs best.只有一层的神经网络最简单，但是同时可能会造成性能损失，所以我们要增加隐藏层数和特征数，但是复杂的神经网络又会导致过拟合和计算复杂度太高的问题，所以要权衡这种平衡。

Model Complexity Effects:

Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.
Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.
In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.

默认将数据集分为3部分，60%的训练集，20%的cross-validation set和20%的测试集。

参考：

https://www.coursera.org/learn/machine-learning/supplement/llc5g/deciding-what-to-do-next-revisited

http://www.cnblogs.com/sddai/p/5696834.html

No comments »

Posted in Machine Learning

Tags: Neural Network

How to elaborate and train a Neural Network

March 31st, 2017

Elaborate a Neural Network

First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.

Number of input units = dimension of features x(i)
Number of output units = number of classes
Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.

构建 Neural Network 首先要明确要创建几个隐藏层，每个隐藏层有多少个参数。

首先输入单元个数就是输入的特征数，输出的个数就是分类的个数，每个隐藏层中单元的个数是多少？

通常意义上，隐藏层中单元的个数越多，这个分类效果越好，但是需要权衡计算与特征数的关系。默认情况，一个神经网络会存在一个隐藏层，当多余一个隐藏层的情况下，每层拥有的单元个数相同。

Training a Neural Network

Randomly initialize the weights
Implement forward propagation to get hΘ(x(i)) for any x(i)
Implement the cost function
Implement backpropagation to compute partial derivatives
Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

输入的特征的权重是随机指定的（如果全部输入的权重都为一个常数，那么输入到隐藏层的值就是相同的，那么导致hΘ(x(i))也是相同的，导致symmetry。不同的初始权重就是为了Symmetry Breaking）。

实现前馈传播算法，计算出每层的x(i)，实现代价函数，通过反向传播算法计算每个Θ的偏导，然后通过梯度检查测试反向传播算法是否成功，然后将梯度检查disable掉（梯度检查计算复杂度太高）。

最后使用梯度下降找到最小的代价函数值和Θ。这个就是需要的特征集。

参考：

http://blog.csdn.net/jinlianjingd/article/details/50767743

https://www.coursera.org/learn/machine-learning/supplement/Uskwd/putting-it-together

No comments »

Posted in Machine Learning

Tags: Machine Learning Neural Network

字符串切割问题求解

June 30th, 2015

我在做Leetecode的一道题时，遇到了一道切割字符串求解回文字符串的题目，题目大意如下:

Given a string s, partition s such that every substring of the partition is a palindrome.

Return all possible palindrome partitioning of s.

For example, given s = "aab",
Return

  [
    ["aa","b"],
    ["a","a","b"]
  ]

这个时候我们需要使用DFS算法，进行深搜，但是这个里我们需要注意的一个问题是，每个字符只能用一次，而且不能使用拼接的方式，需要直接从string s中截取子字符串，所以我们使用s.substr(start,count)的方式。这个不同于之前的combinationSum的题目，需要有一个中间target保存。我们只需要传入下面几个参数即可，使用step来标示当前指向s的开头index，i为结束index。

void DFS(string &s,vector<vector<string>> &result,vector<string> &path,int step){
		if(step>=s.size()){
			result.push_back(path);
			return;
		}
		for(auto i = step;i<s.size();i++){
			if(is_palindrome(s,step,i)){
				path.push_back(s.substr(step,i-step+1));
				DFS(s,result,path,i+1);
				path.pop_back();
			}
		}
	}

	bool is_palindrome(string &s,int start,int end){
		while(start < end){
			if(s[start]!=s[end])
				return false;
			start++;
			end--;
		}
		return true;
	}
};

一个长度为n 的字符串，有n-1 个地方可以砍断，每个地方可断可不断，因此复杂度为O(2^(n-1))

https://leetcode.com/problems/palindrome-partitioning/

No comments »

Posted in Algorithm, C/C++

Tags: Algorithm

Combination Sum 思路

June 1st, 2015

最新刷leetcode的题目，发现了Combination Sum题目，这个题目分为I、II、III。难度层层递进，题目就是遍历vector容器，选择出符合target number的sum组合，这个题目的思路可以参考DFS通用解法。我们使用那个DFS模板，首先要构造dfs函数。

一般情况下，我们需要五个参数：结果，原始数据集，中间结果，当前指向的数据，满足target number的值：

void dfs(vector<vector<int>> &result,vector<int>& candidates,vector<int> path,int current,int target)

然后我们根据candidates中的数据集，深搜这个数据集的各种可能性，将达成target的path中间结果加入result，按照通用模版

void dfs(type &input, type &path, type &result, int cur or gap) {
              if (数据非法) return 0; // 终止条件
              if (cur == input.size()) { // 收敛条件
                  // if (gap == 0) {
                        将path 放入result
              }
              if (可以剪枝) return;
              for(...) { // 执行所有可能的扩展动作
                     执行动作，修改path
                     dfs(input, step + 1 or gap--, result);
                     恢复path
              }
}

第一步：收敛也就是target==0
第二步：使用for(),并在循环中剪枝 if(target-candidates[current]<0) return;
第三步：如果通过第二部，也就意味着这个current合格，可以将这个加入到path中，然后继续深度遍历。dfs(result,candidates,path,current,target-candidates[current]);这里的问题是candidates的数据是可重复的，可以多次使用。如果只使用一次的话，也就意味着我们需要sort(),然后需要将上个满足条件的值跳过，也就是II中的nums[i]==nums[i-1]比较（Combination Sum II）

void dfs(vector<vector<int>> &result,vector<int>& candidates,vector<int> path,int current,int target){
		if(!path.empty()&&target==0){
			result.push_back(path);
			return;
		}
		if(current<candidates.size()){
			int tmp = -1;//start from 0 and 1
			for(;current<candidates.size();current++){
				if(candidates[current]==tmp)
					continue;
				if(target-candidates[current]<0)
					return;

				tmp = candidates[current];
				path.push_back(candidates[current]);
				dfs(result,candidates,path,current+1,target-candidates[current]);
				path.pop_back();
			}
		}
	}

题目：

https://leetcode.com/problems/combination-sum/

https://leetcode.com/problems/combination-sum-ii/

https://leetcode.com/problems/combination-sum-iii/

No comments »

Posted in Algorithm, C/C++, Code杂谈, STL

Tags: Algorithm Code杂谈 STL

Archive for the ‘Algorithm’ category

主成分分析（Principal components analysis）的计算过程

修正Neural Network参数小结

Diagnosing Neural Networks

参考：

How to elaborate and train a Neural Network

Elaborate a Neural Network

通常意义上，隐藏层中单元的个数越多，这个分类效果越好，但是需要权衡计算与特征数的关系。默认情况，一个神经网络会存在一个隐藏层，当多余一个隐藏层的情况下，每层拥有的单元个数相同。

Training a Neural Network

参考：

字符串切割问题求解

Combination Sum 思路

Recent Posts

热门文章