Many big-data problems can be expressed as finding "similar" items. In this project we will investigate similarities among 21578 documents from a cleanup collection of documents were made available by ...
collection consists of 22 data files, an SGML DTD file describing the data file format, and six files describing the categories used to index the data. Each of the first 21 files as "Sets" of "Strings ...