02.17.07
Computer Science At Last, ISR Homework
At long last I have an update! A real update. I have been delving ever so slightly into the realm of Information Storage and Retrieval. This wasn’t my first time writing code to build a dictionary of terms, but other lessons have been interesting. Generating effective stop lists and weighting terms are all very practical techniques. And our textbook is free!
–
Assignment 1:
Read the computing abstracts file and generate a dictionary of terms. Convert terms to lower case remove extraneous endings and special characters. Sort the dictionary by frequency and turn in, along with the program, the first page of the sorted results.
The code, library, and related files:
http://www.nick-cash.com/download/isr/1/
Notes:
I spent more time on this then it required simply to get a workable namespace going. All classes and related functions will go into the ISR library for use in later homework, and, at the end, I’ll have a workable Info Storage and Retrieval library with tested code.
—
Assignment 2:
Using the dictionary from the above, select high frequency and low frequency words and from these build a file that will be a stop list. Write a program that will process the abstracts (title and abstract). Your program will read each word in each title/abstract text and ignore words that are in the stop list. It will build a new dictionary vector giving the total number of times each word occurs and a document frequency vector giving the number of documents in which each term occurs. It will then calculate, for each word, the Inverse Document Frequency weight of the word. Hint: for each document, build the dictionary directly and generate a document-term matrix. Then after all documents have been processed, create the document-frequency vector by going through the document vectors and incrementing the D-F vector for each word found
The code, library, and related files:
http://www.nick-cash.com/download/isr/2/
Notes:
This assignment looked daunting, but when I got started it was actually relatively easy. It was just a lot of work. I hit a few spots that proved troublesome, mainly due to the reading of the file (I should have written my own reading routine that processed each character instead of extracting from std::cin) and my being tired. Overall I’m fairly proud of this code as I didn’t know how to do any of it when I started. However, as with most code I look back on, I see things I definitely could improve. This code isn’t particularly efficient in my opinion, but overall it still runs fairly fast.