使用MapReduce進行LSH
巨量資料的期末作業我們想使用LSH
來分析相似文章,可用來判斷是否抄襲,或者推薦相似內容的文章
STEPS
Dealing the raw text
: Since we want to generate shingles by characters, we need to filter the line symbols and all the space symbols into only one space symbol.
# Raw data
(CNN)Simona Halep declared on the eve of the French Open that about "15 players" were the favorite but the way she has played at Roland Garros in the last week, the number has dwindled to one -- and it is the Romanian herself.
The 2014 finalist would have been the heavy favorite had she not sustained an ankle injury in the final of the Italian Open last month against Elina Svitolina -- a week after defending her crown at the Madrid Open. However, the third seed has brushed aside any fitness concerns by reaching the last eight without conceding a set.
Shingling
: We use k-shingles and count it with characters. Hence, we implement the first MapReduce job to generate all shingles. In the mapper, we just generate all shingles and emit them to reducer. We use the first character of the shingle as key. Then, we use the shingle and document Id as value. In this way, the amount of reducers will decrease.
String keyName = shingle.toString();
context.write(new Text(String.valueOf(keyName.charAt(0))),new Text(keyName+key.toString()));
In the reducer, each will get shingles from every document, and then we need to generate a matrix representing whether the shingle appears in the document.
M, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0
M, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0
M, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1
M, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0
M, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0
-
Pre-workofTransform
: We need to transform the result of previous stage into the form of<row, column, value>
. Otherwise, we don’t know the id of shingles. We can’t transpose the matrix. -
Transform
: The output of the previous stage is stored in the form like this (use shingle as column and document as row)
0, 2, 1
1, 6, 1
1, 8, 1
2, 9, 1
3, 5, 1
However, we need to transform it into the form like this (use document as row and shingle as column) so that it could be used in the next stage.
M, 0, 0000000100010000000....
Hence, we need to design a MapReduce job. First, in the mapper, we just emit it in this form: <K, V> = <columnID, (rowID,value)>
. Next, in the reducer, we will get the entries from each column. Finally, we could output the matrix in the form using documents as row index.
Minhashing
: In this stage, we need to generate permutation first and compress the amount of shingles to the amount of permutation times. To generate the permutation, we use shuffle function and store all the possibilities in a text file. Then, we utilize it and the transformed matrix as the input of the job.
for (int i=0;i<sizeOfSignature;i++){
Collections.shuffle(solution);
IOUtils.write("P,"+Integer.toString(i)+",", os);
for (int j=0;j<solution.size();j++){
IOUtils.write(Integer.toString(solution.get(j)), os);
if (j!=solution.size()-1){
IOUtils.write("-", os);
}
}
IOUtils.write("\n", os);
}
os.close();
In the mapper, we need to determine the input comes from the permutation file or matrix file. If it belongs to the permutation file, we emit it for every document.
if (matrix.contains("P")){
for(int i=0; i<numOfDocuments;i++){
context.write(new Text(data[1]+","+Integer.toString(i)), new Text("P," + data[2]));
}
}
If it belongs to the matrix file, we throw it for every result of permutation.
if (matrix.contains("M")){
for(int i=0; i<numOfSignatures; i++){
context.write(new Text(Integer.toString(i)+","+data[1]), new Text("M," + data[2]));
}
}
In the reducer, we could receive it for every entry of the signature (result of minhashing). For each reducer, we could get one result of permutation and one row of document. Then, we just do hashing according to the permutation and get the signature of this entry.
LSH
: In this stage, we need to divide the matrix into bands, so we design a MapReduce job for it. In the mapper, we divide the signature id by the band size and use this as key so that we could classify it by band.
String rowPerBand = conf.get("rowPerBand");
String[] data = value.toString().split(",");
int bandNum = Integer.parseInt(data[0])/Integer.parseInt(rowPerBand);
int newRowId = Integer.parseInt(data[0])%Integer.parseInt(rowPerBand);
context.write(new Text(Integer.toString(bandNum)), new Text(Integer.toString(newRowId)+","+data[1]+","+data[2]));
In the reducer, we compare all the signatures in this band, if all the signatures of two documents in the band are the same, and we will regard they are the similar items. In the details, we use a hashmap to determine it. We concatenate all the signature values as hash key, so if there are more than two values in one entry, that means these two items are identical in this band.
0, 0, 7
0, 1, 1
0, 2, 7
0, 3, 7
0, 4, 7
Mathematic Analysis
Choosing number of bands(permutationtimes)
According to the Mathematic analysis in the content of the ppt, we derive a formula stating that the possibility of all the signatures in a band is identical in at least one of n bands, where a band contents r signatures, is 1 − (1 − 𝑃𝑟)𝑛
. Where P = the similarity ratio of the signatures between documents. We than analyze the filtering threshold in different settings of rows per band (r).
Rows=1000, r=5
similarity | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
possibility | 0.001 | 0.06 | 0.38 | 0.87 | 0.99 | 0.99 | 1 | 1 | 1 |
Rows=500, r=5
similarity | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
possibility | 0.0009 | 0.03 | 0.215 | 0.64 | 0.95 | 0.99 | 0.99 | 1 | 1 |
Rows=100, r=5
similarity | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
possibility | 0.001 | 0.006 | 0.047 | 0.186 | 0.47 | 0.802 | 0.975 | 0.99 | 1 |
Rows=50, r=5
similarity | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
possibility | 0.001 | 0.006 | 0.003 | 0.0978 | 0.27 | 0.55 | 0.84 | 0.98 | 0.99 |
Rows=30, r=5
similarity | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
possibility | 0.001 | 0.0019 | 0.014 | 0.059 | 0.173 | 0.38 | 0.66 | 0.907 | 0.99 |
Thus, we can change the filtering threshold by changing the number of bands (permutations) to acquire the corresponding result we want. If we’re working on duplicated detection or plagiarism, we can set the permutation time to 30 or lower, while if the case is similar items recommendation, we can set the permutation time to 500 or even 1000.
In the experiment, we make up 10 documents in different content similarity to show the different results.
Choosing shingle length
We observe that the length of shingle we choose will affect the similarity between the original document and its according signatures. And how long is appropriate for a document depends on how long the document is.
We experiment two different shingle length in our experiment to show the difference.
Things we observe about document length
We observe that the different length of documents will affect total number of shingles generated and thus affect the permutation result, so we make up our testcases to make them at roughly the same length.
Conclussion
The results of 9-shingles are better than the results of 5-shingles. Our documents are probably more suitable for 9-shingles.