Deprecated: (6.006) preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /afs/athena.mit.edu/course/6/6.006/web_scripts/fall08/wiki/includes/Sanitizer.php on line 1470
Document Distance Program Version 1 - 6.006 Wiki

Document Distance Program Version 1

From 6.006 Wiki

Jump to: navigation, search

The initial version of our program for computing the distance between two documents is here: http://courses.csail.mit.edu/6.006/fall08/source/docdist1.py

This program seems to give correct results. Here is some output:

>docdist1.py t1.verne.txt t2.bobsey.txt 
File t1.verne.txt : 1057 lines, 8943 words, 2150 distinct words
File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
The distance between the documents is: 0.582949 (radians)

>docdist1.py t2.bobsey.txt t2.bobsey.txt 
File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
The distance between the documents is: 0.000000 (radians)

>docdist1.py t2.bobsey.txt t3.lewis.txt 
File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
File t3.lewis.txt : 15996 lines, 182355 words, 8530 distinct words
The distance between the documents is: 0.574160 (radians)

However, this program seems very SLOW as the inputs get large.

The last example above seemed to take approximately THREE MINUTES!

There seems to be no hope of comparing all of Shakespeare's works to all of Churchill's in a reasonable amount of time...

What is wrong with the efficiency of this program?

Can you figure it out?

Next: Document Distance Program Version 2
Previous: Document Distance Data Sets
Up: Document Distance

Personal tools