Qi

LCA最近公共祖先算法

发表于 2022-12-27 更新于 2023-08-09 分类于 algrithom 299 1 分钟

在树（如二叉树）中，LCA（Lowest Common Ancestor）指的是两个节点的最近公共祖先节点。以下是一个二叉树中寻找LCA的Java代码示例：

class TreeNode {
    int val;
    TreeNode left;
    TreeNode right;

    TreeNode(int val) {
        this.val = val;
        this.left = null;
        this.right = null;
    }
}

public class LowestCommonAncestor {

    public static TreeNode findLCA(TreeNode root, TreeNode node1, TreeNode node2) {
        if (root == null || root == node1 || root == node2) {
            return root;
        }

        TreeNode leftLCA = findLCA(root.left, node1, node2);
        TreeNode rightLCA = findLCA(root.right, node1, node2);

        if (leftLCA != null && rightLCA != null) {
            return root;  // root is the LCA
        }

        return (leftLCA != null) ? leftLCA : rightLCA;
    }

    public static void main(String[] args) {
        TreeNode root = new TreeNode(3);
        root.left = new TreeNode(5);
        root.right = new TreeNode(1);
        root.left.left = new TreeNode(6);
        root.left.right = new TreeNode(2);
        root.right.left = new TreeNode(0);
        root.right.right = new TreeNode(8);
        root.left.right.left = new TreeNode(7);
        root.left.right.right = new TreeNode(4);

        TreeNode node1 = root.left;
        TreeNode node2 = root.right;
        TreeNode lca = findLCA(root, node1, node2);

        System.out.println("Lowest Common Ancestor of " + node1.val + " and " + node2.val + " is: " + lca.val);
    }
}

在这个示例中，findLCA 方法使用递归的方式在二叉树中寻找两个给定节点的最近公共祖先。它首先检查根节点是否为其中之一，然后递归搜索左子树和右子树，最终找到最近公共祖先。

请注意，这个示例是基于二叉树的LCA算法，实际应用中可能需要根据具体的数据结构和问题进行适当的修改。

KMP算法

发表于 2022-12-27 更新于 2022-12-28 分类于 algrithom 698 3 分钟

KMP算法解决的是字符串模式匹配定位问题

主串： ABACBFG —i
模式串：ABAD —j

简单算法：

从左到右一个一个匹配，遇到不匹配，i回到i-j+1,j回到0，重新匹配（不考虑模式串本身特性–最大前后缀数）

/**

 * @param ts 主串

 * @param ps 模式串

 * @return 如果找到，返回在主串中第一个字符出现的下标，否则为-1

 */

public static int bf(String ts, String ps) {

    char[] t = ts.toCharArray();

    char[] p = ps.toCharArray();

    int i = 0; // 主串的位置

    int j = 0; // 模式串的位置

    while (i < t.length && j < p.length) {

       if (t[i] == p[j]) { // 当两个字符相同，就比较下一个

           i++;

           j++;

       } else {

           i = i - j + 1; // 一旦不匹配，i后退

           j = 0; // j归0

       }

    }

    if (j == p.length) {

       return i - j;

    } else {

       return -1;

    }

}

优化：考虑先前匹配的结果和模式串的特点（最长前缀），i可以不动，只移动j，到位置k

寻找k

规律：最前面的k个字符和j之前的最后k个字符是一样的

推导：

T[i] != P[j] 
T[i-j ~ i-1] == P[0 ~ j-1] 
P[0 ~ k-1] == P[j-k ~ j-1]
==> T[i-k ~ i-1] == P[0 ~ k-1]

由于每个j对应的k都可能不同，可以找到一个j~k对应关系 `next[j] = k`

四个规则

j在最左边了，匹配失败，i指针后移
P[0 ~ k-1] == p[j-k ~ j-1] 且 P[k] == P[j]时，next[j+1] == k + 1 == next[j] + 1，找到最大公共前后缀数
P[0 ~ k-1] == p[j-k ~ j-1] 且 P[k] != P[j]时，k=next[k]（在位置k不匹配了，往下找不到更长的前缀字串了，但可以根据k找到最长前后缀数）

递归思想: 不匹配，递归找不匹配位置k前面有没有相同前缀后缀，一直找到next[0]=-1为止，此时next[j]=k+1=-1+1=0

public static int[] getNext(String ps) {

    char[] p = ps.toCharArray();

    int[] next = new int[p.length];

    next[0] = -1;

    int j = 0;

    int k = -1;

    while (j < p.length - 1) {

       if (k == -1 || p[j] == p[k]) {

           next[++j] = ++k;

       } else {

           k = next[k];

       }

    }

    return next;

}

当P[j] == P[next[j]],已经与主串不匹配`

public static int[] getNext(String ps) {

    char[] p = ps.toCharArray();

    int[] next = new int[p.length];

    next[0] = -1;

    int j = 0;

    int k = -1;

    while (j < p.length - 1) {

       if (k == -1 || p[j] == p[k]) {

           if (p[++j] == p[++k]) { // 当两个字符相等时要跳过

              next[j] = next[k];

           } else {

              next[j] = k;

           }

       } else {

           k = next[k];

       }

    }

    return next;

}

匹配过程

public static int KMP(String ts, String ps) {

    char[] t = ts.toCharArray();

    char[] p = ps.toCharArray();

    int i = 0; // 主串的位置

    int j = 0; // 模式串的位置

    int[] next = getNext(ps);

    while (i < t.length && j < p.length) {

       if (j == -1 || t[i] == p[j]) { // 当j为-1时，要移动的是i，当然j也要归0

           i++;

           j++;

       } else {

           // i不需要回溯了

           // i = i - j + 1;

           j = next[j]; // j回到指定位置

       }

    }

    if (j == p.length) {

       return i - j;

    } else {

       return -1;

    }

}

EM算法

发表于 2022-12-27 更新于 2022-12-28 分类于 algrithom 29 1 分钟

极大似然估计：应用广泛的参数估计方法

链接：EM算法详细推导和讲解

DP 树的最小支配集，最小点覆盖与最大独立集

发表于 2022-12-27 更新于 2023-08-09 分类于 algrithom 378 1 分钟

在图论中，树的最小支配集（Minimum Dominating Set）、最小点覆盖（Minimum Vertex Cover）和最大独立集（Maximum Independent Set）是经典的问题，可以使用动态规划来解决。下面是使用动态规划求解这些问题的Java代码示例：

import java.util.ArrayList;
import java.util.List;

class TreeNode {
    int val;
    List<TreeNode> children;

    TreeNode(int val) {
        this.val = val;
        this.children = new ArrayList<>();
    }
}

public class TreeProblems {

    // 最小支配集问题
    public static int minDominatingSet(TreeNode root) {
        if (root == null) {
            return 0;
        }
        
        int take = 1;
        int notTake = 0;
        
        for (TreeNode child : root.children) {
            take += minDominatingSet(child);
            for (TreeNode grandchild : child.children) {
                notTake += minDominatingSet(grandchild);
            }
        }
        
        return Math.min(take, notTake);
    }

    // 最小点覆盖问题
    public static int minVertexCover(TreeNode root) {
        if (root == null) {
            return 0;
        }

        int take = 1;
        int notTake = 0;

        for (TreeNode child : root.children) {
            take += minVertexCover(child);
            notTake += child.children.size() > 0 ? minVertexCover(child.children.get(0)) : 0;
        }

        return Math.min(take, notTake);
    }

    // 最大独立集问题
    public static int maxIndependentSet(TreeNode root) {
        if (root == null) {
            return 0;
        }

        int take = 1;
        int notTake = 0;

        for (TreeNode child : root.children) {
            take += maxIndependentSet(child);
            for (TreeNode grandchild : child.children) {
                notTake += maxIndependentSet(grandchild);
            }
        }

        return Math.max(take, notTake);
    }

    public static void main(String[] args) {
        TreeNode root = new TreeNode(1);
        TreeNode child1 = new TreeNode(2);
        TreeNode child2 = new TreeNode(3);
        TreeNode child3 = new TreeNode(4);
        
        root.children.add(child1);
        root.children.add(child2);
        root.children.add(child3);
        
        TreeNode grandchild1 = new TreeNode(5);
        TreeNode grandchild2 = new TreeNode(6);
        
        child1.children.add(grandchild1);
        child2.children.add(grandchild2);

        System.out.println("Minimum Dominating Set: " + minDominatingSet(root));
        System.out.println("Minimum Vertex Cover: " + minVertexCover(root));
        System.out.println("Maximum Independent Set: " + maxIndependentSet(root));
    }
}

在这个示例中，我们定义了一个简单的树结构（TreeNode），并使用动态规划方法解决了最小支配集、最小点覆盖和最大独立集问题。每个问题都是通过递归计算来实现的。请注意，实际应用中的树结构可能会更加复杂，代码也会相应地调整。

随机森林

发表于 2022-12-27 更新于 2023-08-09 分类于 mechine 387 1 分钟

随机森林（Random Forest）是一种集成学习方法，通过组合多个决策树来提高预测准确性和泛化能力。以下是使用Python实现随机森林的示例代码，使用了scikit-learn库：

首先，确保已安装 scikit-learn 库，可以通过以下命令安装：

1	pip install scikit-learn

接下来，使用下面的代码示例：

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# 加载鸢尾花数据集
iris = load_iris()
X, y = iris.data, iris.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建随机森林分类器
model = RandomForestClassifier(n_estimators=100, random_state=42)

# 训练模型
model.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = model.predict(X_test)

# 计算准确率和分类报告
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred, target_names=iris.target_names)

print("Accuracy:", accuracy)
print("Classification Report:\n", classification_rep)

在上述代码中，我们首先加载了鸢尾花数据集，然后使用train_test_split函数将数据集划分为训练集和测试集。接着，我们创建了一个随机森林分类器，并通过设置n_estimators参数来指定包含的决策树数量。使用fit方法训练模型，在测试集上进行预测，并计算了模型的准确率和分类报告。

随机森林的优点之一是可以处理分类和回归问题，并且不容易过拟合。在实际应用中，您可以根据问题的需求调整超参数，如树的数量、最大深度等，以达到更好的性能。

逻辑回归

发表于 2022-12-27 更新于 2023-08-09 分类于 mechine 429 2 分钟

线性

logistic回归又称logistic回归分析，是一种广义的线性回归分析模型

逻辑回归是一种用于分类问题的机器学习算法，它可以预测一个二元分类目标变量的概率。以下是使用Python实现逻辑回归的示例代码，使用了scikit-learn库：

首先，确保已安装 scikit-learn 库，可以通过以下命令安装：

1	pip install scikit-learn

接下来，使用下面的代码示例：

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# 创建示例数据
np.random.seed(0)
X = np.random.rand(100, 2) * 10
y = (X[:, 0] + X[:, 1] > 10).astype(int)  # 根据线性关系生成二元分类标签

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建逻辑回归模型
model = LogisticRegression()

# 训练模型
model.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = model.predict(X_test)

# 计算准确率和混淆矩阵
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_report(y_test, y_pred))

在上述代码中，我们首先创建了一个示例数据集，其中包含两个特征，根据一个线性关系生成了二元分类标签。然后，使用train_test_split函数将数据集划分为训练集和测试集。接着，我们创建了一个逻辑回归模型，并使用fit方法训练模型。在测试集上进行预测，并计算了模型的准确率、混淆矩阵和分类报告。

请注意，这只是一个简单的逻辑回归示例。在实际应用中，您可能需要调整超参数、进行特征工程等，以达到更好的分类性能。

词袋模型

发表于 2022-12-27 更新于 2023-08-09 分类于 mechine 396 1 分钟

词袋模型（Bag-of-Words Model）是一种用于表示文本数据的方法，它将文本看作是一个包含单词的集合，忽略了单词的顺序和语法结构，只关注单词的频次或存在与否。以下是使用Python实现词袋模型的示例代码，使用了CountVectorizer类从scikit-learn库：

首先，确保已安装 scikit-learn 库，可以通过以下命令安装：

1	pip install scikit-learn

接下来，使用下面的代码示例：

from sklearn.feature_extraction.text import CountVectorizer

# 示例文本数据
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

# 创建词袋模型
vectorizer = CountVectorizer()

# 将文本数据转换为词袋表示
X = vectorizer.fit_transform(corpus)

# 获取词汇表和词频矩阵
vocabulary = vectorizer.get_feature_names_out()
word_frequency_matrix = X.toarray()

# 打印词汇表和词频矩阵
print("Vocabulary:", vocabulary)
print("Word Frequency Matrix:\n", word_frequency_matrix)

在上述代码中，我们首先定义了一个包含多个文本文档的示例文本数据。然后，使用CountVectorizer类将文本数据转换为词袋表示，得到一个词频矩阵。词汇表是所有文本数据中出现的不同单词的集合，而词频矩阵记录了每个文档中每个单词的出现频次。

请注意，词袋模型忽略了单词的顺序和语法结构，因此它可能无法捕捉到一些文本中的上下文信息。在实际应用中，您可能还需要考虑去除停用词、使用TF-IDF等进一步处理，以提高文本表示的效果。

线性回归

发表于 2022-12-27 更新于 2023-08-09 分类于 mechine 476 2 分钟

线性回归是一种用于建模和预测连续型目标变量与一个或多个自变量之间关系的统计方法。以下是使用Python实现简单线性回归的示例代码，使用了numpy和scikit-learn库：

首先，确保已安装 numpy 和 scikit-learn 库，可以通过以下命令安装：

1	pip install numpy scikit-learn

接下来，使用下面的代码示例：

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# 生成示例数据
np.random.seed(0)
X = np.random.rand(100, 1) * 10
y = 2 * X + 3 + np.random.randn(100, 1) * 2  # 使用线性关系生成目标变量

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建线性回归模型
model = LinearRegression()

# 训练模型
model.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = model.predict(X_test)

# 计算均方误差
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# 绘制数据和拟合的直线
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression')
plt.legend()
plt.show()

# 打印模型参数（斜率和截距）
print("Slope (Coefficient):", model.coef_[0])
print("Intercept:", model.intercept_)

在上述代码中，我们首先生成了一个示例数据集，其中包含一个自变量（特征）和一个目标变量。然后，我们使用train_test_split函数将数据集划分为训练集和测试集。接着，创建了一个线性回归模型，并使用fit方法训练模型。在测试集上进行预测，并计算均方误差来衡量模型性能。

最后，我们绘制了测试集数据和模型预测的拟合直线，并打印了模型的斜率（系数）和截距。

请注意，这只是一个简单的线性回归示例。在实际应用中，您可能需要考虑特征选择、模型评估、多元线性回归等问题，以获得更好的预测结果。

留一验证

发表于 2022-12-27 更新于 2023-08-09 分类于 mechine 328 1 分钟

留一验证（Leave-One-Out Cross-Validation，LOOCV）是交叉验证的一种特殊形式，其中每个样本都被单独留出作为验证集，其余样本用于训练模型。这样可以有效地评估模型在每个样本上的性能。

以下是使用Python实现留一验证的示例代码：

from sklearn.datasets import load_iris
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 加载鸢尾花数据集
iris = load_iris()
X, y = iris.data, iris.target

# 创建逻辑回归模型
model = LogisticRegression()

# 初始化计数器
correct_predictions = 0

# 使用留一验证进行模型评估
loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    if y_pred == y_test:
        correct_predictions += 1

accuracy = correct_predictions / len(X)
print("Accuracy:", accuracy)

在上述代码中，我们使用了load_iris函数加载鸢尾花数据集，并创建了一个逻辑回归模型。然后，使用LeaveOneOut对象进行留一验证，对每个样本进行模型训练和测试，计算预测准确率。

请注意，留一验证是一种非常耗时的交叉验证方法，特别是在大型数据集上。在实际应用中，您可能会考虑使用其他交叉验证方法，如K折交叉验证，以在计算资源和性能之间找到平衡。

机器学习-特征选择

发表于 2022-12-27 更新于 2023-08-09 分类于 mechine 451 2 分钟

特征选择是从原始特征中选择出最具有代表性和影响力的特征，以降低维度、提高模型性能、减少过拟合等。以下是一些常见的特征选择方法的示例代码，使用了scikit-learn库：

首先，确保已安装 scikit-learn 库，可以通过以下命令安装：

1	pip install scikit-learn

接下来，使用下面的代码示例：

方差阈值法：

from sklearn.feature_selection import VarianceThreshold
from sklearn.datasets import make_classification

# 创建一个示例数据集
X, y = make_classification(n_samples=100, n_features=5, random_state=42)

# 使用方差阈值法进行特征选择
var_threshold = VarianceThreshold(threshold=0.1)
X_selected = var_threshold.fit_transform(X)

print("Original number of features:", X.shape[1])
print("Number of selected features:", X_selected.shape[1])

互信息法：

from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.datasets import make_classification

# 创建一个示例数据集
X, y = make_classification(n_samples=100, n_features=10, random_state=42)

# 使用互信息法进行特征选择
kbest = SelectKBest(score_func=mutual_info_classif, k=5)
X_selected = kbest.fit_transform(X, y)

print("Original number of features:", X.shape[1])
print("Number of selected features:", X_selected.shape[1])

递归特征消除法：

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# 创建一个示例数据集
X, y = make_classification(n_samples=100, n_features=10, random_state=42)

# 使用递归特征消除法进行特征选择
estimator = LogisticRegression(solver='liblinear')
rfe = RFE(estimator, n_features_to_select=5)
X_selected = rfe.fit_transform(X, y)

print("Original number of features:", X.shape[1])
print("Number of selected features:", X_selected.shape[1])

在上述代码中，我们分别使用了方差阈值法、互信息法和递归特征消除法进行特征选择。您可以根据问题的需求选择适合的特征选择方法，调整相关参数以达到最佳效果。特征选择是数据预处理的重要一步，可以提高模型性能并降低计算成本。

奇异值分解在特征降维中的作用

奇异值分解在推荐系统中的应用

简单算法：

寻找k

推导：

由于每个j对应的k都可能不同，可以找到一个j~k对应关系 next[j] = k

四个规则

当P[j] == P[next[j]],已经与主串不匹配`

匹配过程

线性

由于每个j对应的k都可能不同，可以找到一个j~k对应关系 `next[j] = k`