4444 * replaced by a single space, and a k-gram is a sequence of k characters.
4545 *
4646 * Default value of k is 3. A good rule of thumb is to imagine that there are
47- * only 20 characters and estimate the number of k-shingles as 20^k. For
48- * small documents like e-mails, k = 5 is a recommended value. For large
49- * documents, such as research articles, k = 9 is considered a safe choice.
47+ * only 20 characters and estimate the number of k-shingles as 20^k. For small
48+ * documents like e-mails, k = 5 is a recommended value. For large documents,
49+ * such as research articles, k = 9 is considered a safe choice.
5050 *
5151 * @author Thibault Debatty
5252 */
@@ -93,11 +93,10 @@ public int getK() {
9393 /**
9494 * Compute and return the profile of s, as defined by Ukkonen "Approximate
9595 * string-matching with q-grams and maximal matches".
96- * https://www.cs.helsinki.fi/u/ukkonen/TCS92.pdf
97- * The profile is the number of occurrences of k-shingles, and is used to
98- * compute q-gram similarity, Jaccard index, etc.
99- * Pay attention: the memory requirement of the profile can be up to
100- * k * size of the string
96+ * https://www.cs.helsinki.fi/u/ukkonen/TCS92.pdf The profile is the number
97+ * of occurrences of k-shingles, and is used to compute q-gram similarity,
98+ * Jaccard index, etc. Pay attention: the memory requirement of the profile
99+ * can be up to k * size of the string
101100 *
102101 * @param string
103102 * @return the profile of this string, as an unmodifiable Map
@@ -109,7 +108,7 @@ public final Map<String, Integer> getProfile(final String string) {
109108 for (int i = 0 ; i < (string_no_space .length () - k + 1 ); i ++) {
110109 String shingle = string_no_space .substring (i , i + k );
111110 Integer old = shingles .get (shingle );
112- if (old != null ) {
111+ if (old != null ) {
113112 shingles .put (shingle , old + 1 );
114113 } else {
115114 shingles .put (shingle , 1 );
0 commit comments