《算法》(第4版)第5章读书笔记
字符串的应用:1..信息处理:使用字符串来处理应用程序。所有信息都是用一系列字符串表示的
2.基因组学:计算生物学家的一项工作就是根据密码子将DNA转换成由4个碱基组成的字符串
3.通信系统:无论你是在发送短信,电子邮件或是下载电子书,都是将字符串从一个地方传送到另一个地方。
4.编程系统:程序是由字符串构成的。编译器,解释器等其他的能狗将程序转换成机器指令的软件都是使用复杂的字符串处理技术的重要应用软件。
5.0.1 JAVA表示字符串的两种方法
字符数组 char[]a JAVA字符串 String s
5.0.2 字母表
字母表的API
Alphabet(String s) 根据s中的字符创建一张新的字母表
char toChar(int index) 获取字母表中索引位置的字符
int toIndex(char c)获得c的索引,在0~R-1之间
boolean contains(char c) 检验c是否在字母表之中
int r() 基数
int 1gR() 表示一个索引所需的比特数
int[] toIndices(String s) 将s转换成R进制的整数
String toChars(int[] indices) 将R进制的整数转换成基于该字母表的字符串
字符串排序
第一类方***从右到左检查键中的字符。这种方法一般被称为低位优先的字符串排序。这种方法最适合键的长度都相同的字符串排序应用。
第二位方***从左到右检查键中的字符,首先查看的是最高位的字符,被称为高位优先的字符串排序。
低位优先LSD
public class LSD { private static final int BITS_PER_BYTE = 8; // do not instantiate private LSD() { } /** * Rearranges the array of W-character strings in ascending order. * * @param a the array to be sorted * @param w the number of characters per string */ public static void sort(String[] a, int w) { int n = a.length; int R = 256; // extend ASCII alphabet size String[] aux = new String[n]; for (int d = w-1; d >= 0; d--) { // sort by key-indexed counting on dth character // compute frequency counts int[] count = new int[R+1]; for (int i = 0; i < n; i++) count[a[i].charAt(d) + 1]++; // compute cumulates for (int r = 0; r < R; r++) count[r+1] += count[r]; // move data for (int i = 0; i < n; i++) aux[count[a[i].charAt(d)]++] = a[i]; // copy back for (int i = 0; i < n; i++) a[i] = aux[i]; } }
public static void sort(int[] a) { final int BITS = 32; // each int is 32 bits final int R = 1 << BITS_PER_BYTE; // each bytes is between 0 and 255 final int MASK = R - 1; // 0xFF final int w = BITS / BITS_PER_BYTE; // each int is 4 bytes int n = a.length; int[] aux = new int[n]; for (int d = 0; d < w; d++) { // compute frequency counts int[] count = new int[R+1]; for (int i = 0; i < n; i++) { int c = (a[i] >> BITS_PER_BYTE*d) & MASK; count[c + 1]++; } // compute cumulates for (int r = 0; r < R; r++) count[r+1] += count[r]; // for most significant byte, 0x80-0xFF comes before 0x00-0x7F if (d == w-1) { int shift1 = count[R] - count[R/2]; int shift2 = count[R/2]; for (int r = 0; r < R/2; r++) count[r] += shift1; for (int r = R/2; r < R; r++) count[r] -= shift2; } // move data for (int i = 0; i < n; i++) { int c = (a[i] >> BITS_PER_BYTE*d) & MASK; aux[count[c]++] = a[i]; } // copy back for (int i = 0; i < n; i++) a[i] = aux[i]; } }
public static void main(String[] args) { String[] a = StdIn.readAllStrings(); int n = a.length; // check that strings have fixed length int w = a[0].length(); for (int i = 0; i < n; i++) assert a[i].length() == w : "Strings must have fixed length"; // sort the strings sort(a, w); // print results for (int i = 0; i < n; i++) StdOut.println(a[i]); } }
高位优先MSD
public class MSD { private static final int BITS_PER_BYTE = 8; private static final int BITS_PER_INT = 32; // each Java int is 32 bits private static final int R = 256; // extended ASCII alphabet size private static final int CUTOFF = 15; // cutoff to insertion sort // do not instantiate private MSD() { } /** * Rearranges the array of extended ASCII strings in ascending order. * * @param a the array to be sorted */ public static void sort(String[] a) { int n = a.length; String[] aux = new String[n]; sort(a, 0, n-1, 0, aux); } // return dth character of s, -1 if d = length of string private static int charAt(String s, int d) { assert d >= 0 && d <= s.length(); if (d == s.length()) return -1; return s.charAt(d); } // sort from a[lo] to a[hi], starting at the dth character private static void sort(String[] a, int lo, int hi, int d, String[] aux) { // cutoff to insertion sort for small subarrays if (hi <= lo + CUTOFF) { insertion(a, lo, hi, d); return; } // compute frequency counts int[] count = new int[R+2]; for (int i = lo; i <= hi; i++) { int c = charAt(a[i], d); count[c+2]++; } // transform counts to indicies for (int r = 0; r < R+1; r++) count[r+1] += count[r]; // distribute for (int i = lo; i <= hi; i++) { int c = charAt(a[i], d); aux[count[c+1]++] = a[i]; } // copy back for (int i = lo; i <= hi; i++) a[i] = aux[i - lo]; // recursively sort for each character (excludes sentinel -1) for (int r = 0; r < R; r++) sort(a, lo + count[r], lo + count[r+1] - 1, d+1, aux); } // insertion sort a[lo..hi], starting at dth character private static void insertion(String[] a, int lo, int hi, int d) { for (int i = lo; i <= hi; i++) for (int j = i; j > lo && less(a[j], a[j-1], d); j--) exch(a, j, j-1); } // exchange a[i] and a[j] private static void exch(String[] a, int i, int j) { String temp = a[i]; a[i] = a[j]; a[j] = temp; } // is v less than w, starting at character d private static boolean less(String v, String w, int d) { // assert v.substring(0, d).equals(w.substring(0, d)); for (int i = d; i < Math.min(v.length(), w.length()); i++) { if (v.charAt(i) < w.charAt(i)) return true; if (v.charAt(i) > w.charAt(i)) return false; } return v.length() < w.length(); } /** * Rearranges the array of 32-bit integers in ascending order. * Currently assumes that the integers are nonnegative. * * @param a the array to be sorted */ public static void sort(int[] a) { int n = a.length; int[] aux = new int[n]; sort(a, 0, n-1, 0, aux); } // MSD sort from a[lo] to a[hi], starting at the dth byte private static void sort(int[] a, int lo, int hi, int d, int[] aux) { // cutoff to insertion sort for small subarrays if (hi <= lo + CUTOFF) { insertion(a, lo, hi, d); return; } // compute frequency counts (need R = 256) int[] count = new int[R+1]; int mask = R - 1; // 0xFF; int shift = BITS_PER_INT - BITS_PER_BYTE*d - BITS_PER_BYTE; for (int i = lo; i <= hi; i++) { int c = (a[i] >> shift) & mask; count[c + 1]++; } // transform counts to indicies for (int r = 0; r < R; r++) count[r+1] += count[r]; /************* BUGGGY CODE. // for most significant byte, 0x80-0xFF comes before 0x00-0x7F if (d == 0) { int shift1 = count[R] - count[R/2]; int shift2 = count[R/2]; for (int r = 0; r < R/2; r++) count[r] += shift1; for (int r = R/2; r < R; r++) count[r] -= shift2; } ************************************/ // distribute for (int i = lo; i <= hi; i++) { int c = (a[i] >> shift) & mask; aux[count[c]++] = a[i]; } // copy back for (int i = lo; i <= hi; i++) a[i] = aux[i - lo]; // no more bits if (d == 4) return; // recursively sort for each character if (count[0] > 0) sort(a, lo, lo + count[0] - 1, d+1, aux); for (int r = 0; r < R; r++) if (count[r+1] > count[r]) sort(a, lo + count[r], lo + count[r+1] - 1, d+1, aux); } // TODO: insertion sort a[lo..hi], starting at dth character private static void insertion(int[] a, int lo, int hi, int d) { for (int i = lo; i <= hi; i++) for (int j = i; j > lo && a[j] < a[j-1]; j--) exch(a, j, j-1); } // exchange a[i] and a[j] private static void exch(int[] a, int i, int j) { int temp = a[i]; a[i] = a[j]; a[j] = temp; } /** * Reads in a sequence of extended ASCII strings from standard input; * MSD radix sorts them; * and prints them to standard output in ascending order. * * @param args the command-line arguments */ public static void main(String[] args) { String[] a = StdIn.readAllStrings(); int n = a.length; sort(a); for (int i = 0; i < n; i++) StdOut.println(a[i]); } }
三向字符串快速排序
public class Quick3string { private static final int CUTOFF = 15; // cutoff to insertion sort // do not instantiate private Quick3string() { } /** * Rearranges the array of strings in ascending order. * * @param a the array to be sorted */ public static void sort(String[] a) { StdRandom.shuffle(a); sort(a, 0, a.length-1, 0); assert isSorted(a); } // return the dth character of s, -1 if d = length of s private static int charAt(String s, int d) { assert d >= 0 && d <= s.length(); if (d == s.length()) return -1; return s.charAt(d); } // 3-way string quicksort a[lo..hi] starting at dth character private static void sort(String[] a, int lo, int hi, int d) { // cutoff to insertion sort for small subarrays if (hi <= lo + CUTOFF) { insertion(a, lo, hi, d); return; } int lt = lo, gt = hi; int v = charAt(a[lo], d); int i = lo + 1; while (i <= gt) { int t = charAt(a[i], d); if (t < v) exch(a, lt++, i++); else if (t > v) exch(a, i, gt--); else i++; } // a[lo..lt-1] < v = a[lt..gt] < a[gt+1..hi]. sort(a, lo, lt-1, d); if (v >= 0) sort(a, lt, gt, d+1); sort(a, gt+1, hi, d); } // sort from a[lo] to a[hi], starting at the dth character private static void insertion(String[] a, int lo, int hi, int d) { for (int i = lo; i <= hi; i++) for (int j = i; j > lo && less(a[j], a[j-1], d); j--) exch(a, j, j-1); } // exchange a[i] and a[j] private static void exch(String[] a, int i, int j) { String temp = a[i]; a[i] = a[j]; a[j] = temp; } // is v less than w, starting at character d // DEPRECATED BECAUSE OF SLOW SUBSTRING EXTRACTION IN JAVA 7 // private static boolean less(String v, String w, int d) { // assert v.substring(0, d).equals(w.substring(0, d)); // return v.substring(d).compareTo(w.substring(d)) < 0; // } // is v less than w, starting at character d private static boolean less(String v, String w, int d) { assert v.substring(0, d).equals(w.substring(0, d)); for (int i = d; i < Math.min(v.length(), w.length()); i++) { if (v.charAt(i) < w.charAt(i)) return true; if (v.charAt(i) > w.charAt(i)) return false; } return v.length() < w.length(); } // is the array sorted private static boolean isSorted(String[] a) { for (int i = 1; i < a.length; i++) if (a[i].compareTo(a[i-1]) < 0) return false; return true; } /** * Reads in a sequence of fixed-length strings from standard input; * 3-way radix quicksorts them; * and prints them to standard output in ascending order. * * @param args the command-line arguments */ public static void main(String[] args) { // read in the strings from standard input String[] a = StdIn.readAllStrings(); int n = a.length; // sort the strings sort(a); // print the results for (int i = 0; i < n; i++) StdOut.println(a[i]); } }
暴力子字符串查找
import java.math.BigInteger; import java.util.Random; /** * The {@code RabinKarp} class finds the first occurrence of a pattern string * in a text string. * <p> * This implementation uses the Rabin-Karp algorithm. * <p> * For additional documentation, * see <a href="https://algs4.cs.princeton.edu/53substring">Section 5.3</a> of * <i>Algorithms, 4th Edition</i> by Robert Sedgewick and Kevin Wayne. */ public class RabinKarp { private String pat; // the pattern // needed only for Las Vegas private long patHash; // pattern hash value private int m; // pattern length private long q; // a large prime, small enough to avoid long overflow private int R; // radix private long RM; // R^(M-1) % Q /** * Preprocesses the pattern string. * * @param pattern the pattern string * @param R the alphabet size */ public RabinKarp(char[] pattern, int R) { this.pat = String.valueOf(pattern); this.R = R; throw new UnsupportedOperationException("Operation not supported yet"); } /** * Preprocesses the pattern string. * * @param pat the pattern string */ public RabinKarp(String pat) { this.pat = pat; // save pattern (needed only for Las Vegas) R = 256; m = pat.length(); q = longRandomPrime(); // precompute R^(m-1) % q for use in removing leading digit RM = 1; for (int i = 1; i <= m-1; i++) RM = (R * RM) % q; patHash = hash(pat, m); } // Compute hash for key[0..m-1]. private long hash(String key, int m) { long h = 0; for (int j = 0; j < m; j++) h = (R * h + key.charAt(j)) % q; return h; } // Las Vegas version: does pat[] match txt[i..i-m+1] ? private boolean check(String txt, int i) { for (int j = 0; j < m; j++) if (pat.charAt(j) != txt.charAt(i + j)) return false; return true; } // Monte Carlo version: always return true // private boolean check(int i) { // return true; //} /** * Returns the index of the first occurrrence of the pattern string * in the text string. * * @param txt the text string * @return the index of the first occurrence of the pattern string * in the text string; n if no such match */ public int search(String txt) { int n = txt.length(); if (n < m) return n; long txtHash = hash(txt, m); // check for match at offset 0 if ((patHash == txtHash) && check(txt, 0)) return 0; // check for hash match; if hash match, check for exact match for (int i = m; i < n; i++) { // Remove leading digit, add trailing digit, check for match. txtHash = (txtHash + q - RM*txt.charAt(i-m) % q) % q; txtHash = (txtHash*R + txt.charAt(i)) % q; // match int offset = i - m + 1; if ((patHash == txtHash) && check(txt, offset)) return offset; } // no match return n; } // a random 31-bit prime private static long longRandomPrime() { BigInteger prime = BigInteger.probablePrime(31, new Random()); return prime.longValue(); } /** * Takes a pattern string and an input string as command-line arguments; * searches for the pattern string in the text string; and prints * the first occurrence of the pattern string in the text string. * * @param args the command-line arguments */ public static void main(String[] args) { String pat = args[0]; String txt = args[1]; RabinKarp searcher = new RabinKarp(pat); int offset = searcher.search(txt); // print results StdOut.println("text: " + txt); // from brute force search method 1 StdOut.print("pattern: "); for (int i = 0; i < offset; i++) StdOut.print(" "); StdOut.println(pat); } }
一般匹配字符串时,我们从目标字符串str(假设长度为n)的第一个下标选取和ptr长度(长度为m)一样的子字符串进行比较,如果一样,就返回开始处的下标值,不一样,选取str下一个下标,同样选取长度为n的字符串进行比较,直到str的末尾(实际比较时,下标移动到n-m)。这样的时间复杂度是O(n*m)。
KMP算法:可以实现复杂度为O(m+n)
为何简化了时间复杂度:
充分利用了目标字符串ptr的性质(比如里面部分字符串的重复性,即使不存在重复字段,在比较时,实现最大的移动量)。
上面理不理解无所谓,我说的其实也没有深刻剖析里面的内部原因。
考察目标字符串ptr:
ababaca
这里我们要计算一个长度为m的转移函数next。
next数组的含义就是一个固定字符串的最长前缀和最长后缀相同的长度。
比如:abcjkdabc,那么这个数组的最长前缀和最长后缀相同必然是abc。
cbcbc,最长前缀和最长后缀相同是cbc。
abcbc,最长前缀和最长后缀相同是不存在的。
**注意最长前缀:是说以第一个字符开始,但是不包含最后一个字符。
比如aaaa相同的最长前缀和最长后缀是aaa。**
对于目标字符串ptr,ababaca,长度是7,所以next[0],next[1],next[2],next[3],next[4],next[5],next[6]分别计算的是
a,ab,aba,abab,ababa,ababac,ababaca的相同的最长前缀和最长后缀的长度。由于a,ab,aba,abab,ababa,ababac,ababaca的相同的最长前缀和最长后缀是“”,“”,“a”,“ab”,“aba”,“”,“a”,所以next数组的值是[-1,-1,0,1,2,-1,0],这里-1表示不存在,0表示存在长度为1,2表示存在长度为3。这是为了和代码相对应。
下图中的1,2,3,4是一样的。1-2之间的和3-4之间的也是一样的,我们发现A和B不一样;之前的算法是我把下面的字符串往前移动一个距离,重新从头开始比较,那必然存在很多重复的比较。现在的做法是,我把下面的字符串往前移动,使3和2对其,直接比较C和A是否一样。
#笔记##读书笔记#KMP算法:可以实现复杂度为O(m+n)
为何简化了时间复杂度:
充分利用了目标字符串ptr的性质(比如里面部分字符串的重复性,即使不存在重复字段,在比较时,实现最大的移动量)。
上面理不理解无所谓,我说的其实也没有深刻剖析里面的内部原因。
考察目标字符串ptr:
ababaca
这里我们要计算一个长度为m的转移函数next。
next数组的含义就是一个固定字符串的最长前缀和最长后缀相同的长度。
比如:abcjkdabc,那么这个数组的最长前缀和最长后缀相同必然是abc。
cbcbc,最长前缀和最长后缀相同是cbc。
abcbc,最长前缀和最长后缀相同是不存在的。
**注意最长前缀:是说以第一个字符开始,但是不包含最后一个字符。
比如aaaa相同的最长前缀和最长后缀是aaa。**
对于目标字符串ptr,ababaca,长度是7,所以next[0],next[1],next[2],next[3],next[4],next[5],next[6]分别计算的是
a,ab,aba,abab,ababa,ababac,ababaca的相同的最长前缀和最长后缀的长度。由于a,ab,aba,abab,ababa,ababac,ababaca的相同的最长前缀和最长后缀是“”,“”,“a”,“ab”,“aba”,“”,“a”,所以next数组的值是[-1,-1,0,1,2,-1,0],这里-1表示不存在,0表示存在长度为1,2表示存在长度为3。这是为了和代码相对应。
下图中的1,2,3,4是一样的。1-2之间的和3-4之间的也是一样的,我们发现A和B不一样;之前的算法是我把下面的字符串往前移动一个距离,重新从头开始比较,那必然存在很多重复的比较。现在的做法是,我把下面的字符串往前移动,使3和2对其,直接比较C和A是否一样。
字符串匹配是字符串的一种基本操作:给定一个长度为 M 的文本和一个长度为 N 的模式串,在文本中找到一个和该模式相符的子字符串,并返回该字字符串在文本中的位置。
KMP 算法,全称是 Knuth-Morris-Pratt 算法,以三个发明者命名,开头的那个K就是著名科学家 Donald Knuth 。KMP 算法的关键是求 next 数组。next 数组的长度为模式串的长度。next 数组中每个值代表模式串中当前字符前面的字符串中,有多大长度的相同前缀后缀。
Boyer-Moore 算法在实际应用中比 KMP 算法效率高,据说各种文本编辑器的"查找"功能(Ctrl+F),包括 linux 里的 grep 命令,都是采用 Boyer-Moore 算法。该算法有“坏字符”和“好后缀”两个概念。主要特点是字符串从后往前匹配。
Sunday 算法跟 KMP 算法一样,是从前往后匹配。在匹配失败时,关注文本串中参加匹配的最末位字符的下一位字符,如果该字符不在模式串中,则整个模式串移动到该字符之后。如果该字符在模式串中,将模式串右移使对应的字符对齐。