Python实现的Google simhash算法


VIA

传统的hash函数能够将一样的文本生成一样的hash函数,但是,通过simhash方法,能够差不多相同的文档得到的hash函数也比较相近。

Charikar's hash 通过Charikar‘s hash,能够将比较相似度的文档得到比较相近的fingerprint。

该算法的流程如下:

or super-tokens (word tuples) * Each token is represented by its hash value; a traditional hash function is used * Weights are associated with tokens * A vector V of integers is initialized to 0, length of the vector corresponds to the desired hash size in bits * In a cycle for all token's hash values (h), vector V is updated: o ith element is decreased by token's weight if the ith bit of the hash h is 0, otherwise o ith element is increased by token's weight if the ith bit of the hash h is 1 * Finally, signs of elements of V corresponds to the bits of the final fingerprint 该hash不是将文档总体计算hash值,而是将文档中的每个token计算哈希值,对文档中每个token的hash值,按照位 对hash值进行求和,如果当前token的hash值在该位上是0,则减去1,如果在该位上是1,则加上1.将所有的token按照这种方式累加,求的最终的值作为fingerprint。

simhash简介:http://www.cnblogs.com/linecong/archive/2010/08/28/simhash.html

Googlecode simhash算法简介:http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf

Author Info :
  • From:Python实现的Google simhash算法
  • URL:http://blog.ihipop.info/2010/10/1754.html
  • Please Reserve This Link,Thanks!
  • 《Python实现的Google simhash算法》有2个想法

    1. 我自己也实现了一下simhash,发现基本不能用,来测下,你这个感觉也不能用啊,你找到原因了吗?而且你怎么不分词呢?

    2. 刚对上面这个算法做了一些调试测试,发现算法抗随机字符干扰差,而且两条很不相像的字符串相似度有时也会很高(如下面的测试结果),不太适合我所研究的领域。。

      :新年快乐好消息,见副件 [simhash = 0x7b452ce82fc2468927e08904d80c18995d4c4131859a6bb1be0dc66c3a1acb1f]
      Re:232adt股权激励的主要形式3e3 [simhash = 0xa1414dbe70f726d26a68822f98fb5e2ae11bd2adb92fdea6bcb5bb3fe23491cd]
      ========================================
      76.444232% percent similarity on hash
      126 bits differ out of 256

    发表评论

    电子邮件地址不会被公开。 必填项已用*标注