資源簡介
算法思想:提取文檔的TF/IDF權重,然后用余弦定理計算兩個多維向量的距離來計算兩篇文檔的相似度,用標準的k-means算法就可以實現文本聚類。源碼為java實現

代碼片段和文件信息
package?textcluster;
import?java.util.List;
?///?
????///?分詞器接口
????///?
????public?interface?ITokeniser
????{
????????List?partition(String?input);
????}
?屬性????????????大小?????日期????時間???名稱
-----------?---------??----------?-----??----
?????文件???????1510??2009-05-08?07:30??textcluster\WawaCluster.java
?????文件???????5669??2009-05-08?07:57??textcluster\WawaKMeans.java
?????文件????????204??2009-05-07?11:02??textcluster\ITokeniser.java
?????文件???????1487??2009-05-07?21:58??textcluster\Tokeniser.java
?????文件???????3474??2009-05-08?07:55??textcluster\Program.java
?????文件???????1152??2009-05-07?22:02??textcluster\StopWordsHandler.java
?????文件???????1392??2009-05-07?11:04??textcluster\TermVector.java
?????文件???????6930??2009-05-08?10:27??textcluster\TFIDFMeasure.java
?????文件????????606??2009-05-07?10:45??textcluster\input.txt
?????目錄??????????0??2009-05-08?16:55??textcluster
-----------?---------??----------?-----??----
????????????????22424????????????????????10
- 上一篇:commons-lang3-3.1源碼包
- 下一篇:基于P2P網絡chord算法
評論
共有 條評論