Artificial Intelligence (Python lab)

Assignment 1 (15% of total course weight) – Machine Learning

California State University San Bernardino, School of Computer Science and Engineering (CSE)

Date of Issue: April 19, 2020, Date of submission: April 28, 2021 – 11:59 pm (PST)

Module: CSE 5120 Introduction to Artificial Intelligence

Assessment brief: Text mining is becoming increasingly popular with the voluminous text data being generated (such as tweets, emails exchanged on a daily basis). Automation of a number of applications like sentiment analysis, document classification, topic classification, text summarization, and machine translation have already been successfully achieved via various machine learning approaches. This assignment is based on one of the text mining tasks which implements a spam email filter as a document classification task. You are assigned to develop an email classification model which should classify an email as spam or non-spam (i.e., ham) mail. Spam box in a Gmail account is the best example of this. Two datasets on existing emails are provided in this assignment as available mail corpus. For this part, we will utilize the Ling-spam email dataset for email classification[footnoteRef:1]. Emails from subset of Ling-spam (ling-spam.zip) are already split into train and test sets and will be used as a learning part (with code segments implemented in Assignment details section). Preprocessed form of Enron-spam dataset for which you will write/modify the code in your classifier_enron.py file is provided in enron-spam.zip folder. [1: I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, George Paliouras, and C.D. Spyropoulos, “An Evaluation of Naive Bayesian Anti-Spam Filtering”. In Potamias, G., Moustakis, V. and van Someren, M. (Eds.), Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), Barcelona, Spain, pp. 9-17, 2000.]

The code for this assignment consists of three Python files (classifier_lingSpam.py, classifier_enron.py and evaluation.py) which you will need to read and understand in order to complete the assignment. You can download all the supporting code and files as a zip folder from Assignment 1 link given on Blackboard (emailClassification.zip).
Your assignment is based on two parts as given below:
1. Code implemented for email spam filtering via support vector machines in given classifier.py and evaluation.py files (in specific sections as indicated in detail below)
2. A brief report on what you did for your classifier (i.e., how you implemented with screenshots from evaluation.py script given in the folder)

File Name

Description

classifier_lingSpam.py

Where all of your email spam filter classifier code for Ling-spam dataset (ling-spam.zip) will reside. In this file, you only need to implement the initialization and training of your email spam filter (classifier) as given in assignment details section. You should print confusion matrix and accuracy score for this dataset and copy it in your assignment report.

classifier_enron.py

Where all of your email spam filter classifier code for Enron-spam dataset (enron-spam.zip) will reside. In this file, you will need to copy and paste most of the code you wrote in classifier_lingSpam.py such as create_Dictionary, extract_features with minor modification required to handle different format of the dataset. At the end of this file, you should save your classifier and upload your saved classifier. This will be used to load your trained classifier, predict classes of test cases (given in evaluation.py file) and print your confusion matrix in the output window. You should also print confusion matrix and accuracy through your code in this file and copy them in your assignment report

evaluation.py

The evaluation file where you will implement your code to load your classifier, run it on test cases for the selected dataset, and print your confusion matrix and accuracy score.

emailClassifier_enron.sav

This will be the trained model (SVM) on
Enron-spam
dataset that you will produce in your classifier.py file using Pickle library (see description in the following section)

Assignment Report.doc

Report explaining your work for all the above work with description and screenshots. See below for more details

You can use Spyder (installed through Anaconda from week 1 Thursday’s lecture) or other IDE.

Files to Edit and Submit: You will need to edit and submit (classifier_lingSpam.py, classifier_enron.py) files to implement your email spam filters for the Ling-spam and Enron-spam datasets (i.e, ling-spam.zip and enron-spam.zip). You can copy and paste all the necessary pieces of code that we discuss here in the Assignment details section. Once you have completed the implementation of your classifiers in these files, you should save your classifier for only the
Enron-spam
dataset with the name “emailClassifier_enron.sav” and upload it in Zipped folder in your assignment. To save and load models, you can use “pickle” library from scikit learn, as Pickle is the standard way of serializing objects in Python. You are welcome to search for “save and load machine learning models using pickle” via Google or your favorite search engine for help in implementing this part. Once completed your classifier_lingSpam.py and classifier_enron.py files, you will need to implement the code in evaluation.py file to load your classifier (i.e., emailClassifier_enron.sav) via Pickle, predict the classes for the test cases using 60%, 40% split in evaluation.py and print confusion matrix and accuracy score. For confusion matrix and accuracy scores, please refer to the code we covered in Week 12 for implementing Support Vector Machines and Decision Tree Algorithms. You will need to test your classifier with evaluation.py to copy screenshots in your report.

Academic Dishonesty: Your code will be checked against other submissions in the class for logical redundancy. If you copy someone else’s code and submit it with minor changes, they will be detected, so please do not try that, and submit your own work only. In case of cheating, the University’s academic policies on cheating and dishonesty will strictly apply which may result from the deduction in your grade to expulsion.

Getting Help: If you are having difficulty in implementing the classifier, contact course staff for help and refer to the content recorded and uploaded in Week 12 part 2 (Thursday) on Machine Learning). Office hours and Slack are there for your support. If you are not able to attend office hours, then please inform your instructor to arrange for additional time. The intent is to make this assignment rewarding and instructional, not frustrating and demoralizing. You can either complete this assignment on your own or discuss the problem and collaborate with another member of the class (or different section). Please clearly acknowledge and mention your group member in your assignment report submission who you will collaborate with in this assignment. Your report and program (classifier.py, evaluation.py files and emailClassifier.sav model) will be separately submitted by yourself on Blackboard irrespective of your collaboration with your group member. Group discussions are encouraged but copying of programs is NOT recommended. Programming based on your own skills is encouraged.

Assignment Details

1. Classifier classifier_lingSpam.py for Ling-spam (Ling-spam.zip) – 3%

This dataset is created to implement your code for a test classifier. The code written in various parts of this section can be reused for the second part (enron-spam.zip). According to the directory structure of enron-spam.zip, you will need to make changes in the data preprocessing functions to successfully split the data into train and test splits.
The following steps will be followed to build a support vector machine classifier (classifier_lingSpam.py) for Ling-spam dataset:
a) Data preprocessing: Text data preparation.
b) Word dictionary creation.
c) Feature extraction for support vector machine classifier training
d) Training and testing support vector machine classifier
Once trained on the training dataset, the classifier will be tested on test dataset (given within ling-spam.zip folder).
a. Data preprocessing: Text data preparation.

Ling-spam is already split into training and test sets with 702 and 260 emails respectively, divided equally between spam and ham mails. Spam emails can be recognized within the dataset as they contain *spmsg* in the filenames.
Data preprocessing (i.e., text cleaning) is the first step in your work where the words from the document are removed which may not contribute to the information expected to be extracted. Emails may contain vast number of undesirable characters (e.g., punctuation marks, stop words, and digits) which may not be helpful in detecting spam email. The emails in
Ling-spam
corpus have already been preprocessed in the following ways:
· Removal of stop words – Stop words such as “and”, “the”, and “of” are common in English sentences and do not provide meaningful information for deciding spam or legitimate status, thus removed from the emails.
· Lemmatization – is the process of grouping together different inflected forms of a word which can be analyzed as single items. For example, “include”, “includes,” and “included” would all be represented as “include”. The context of the sentence is also preserved in lemmatization as opposed to stemming (another impressive idea in text mining which ignores the meaning of the sentence).
Non-words including punctuation marks or special characters should also be removed from the email documents. These words can be removed after creating a word dictionary, providing a convenient method for word removal as a dictionary allows the removal of such words only once. Therefore, you do not need to do anything in this case.

b. Creating word dictionary.

A sample email in the dataset will look like the following:
Subject: posting
hi , ‘ m work phonetics project modern irish ‘ m hard source . anyone recommend book article english ? ‘ , specifically interest palatal ( slender ) consonant , work helpful too . thank ! laurel sutton ( sutton @ garnet . berkeley . edu
It can be seen that the first line of the mail is subject, and the 3rd line contains the body of the email. You will only need to carry out text analysis on the content to detect spam mails. First, you will create a dictionary of words and their frequency. For this task, training set of 700 emails is utilized. The python function below will create the dictionary for you.

def create_Dictionary(train_dir):
emails = [os.path.join(train_dir,f) for f in os.listdir(train_dir)]
all_words = []
for mail in emails:
with open(mail) as m:
for i,line in enumerate(m):
if i == 2: #Body of email is only 3rd line of text file
words = line.split()
all_words += words

dictionary = Counter(all_words)
# Code for non-word removal
list_to_remove = dictionary.keys()
for item in list_to_remove:
if item.isalpha() == False:
del dictionary[item]
elif len(item) == 1:
del dictionary[item]
dictionary = dictionary.most_common(3000)
return dictionary

Dictionary can be seen by the command print dictionary. Your dictionary may find some unusual counts for some words which is not surprising. There are always chances to improve the dictionary later. Make sure your dictionary has some of the entries as shown below as most frequent words. In your code, you chose 3000 most frequently used words in the dictionary.
[(‘order’, 1414), (‘address’, 1293), (‘report’, 1216), (‘mail’, 1127), (‘send’, 1079), (‘language’, 1072), (’email’, 1051), (‘program’, 1001), (‘our’, 987), (‘list’, 935), (‘one’, 917), (‘name’, 878), (‘receive’, 826), (‘money’, 788), (‘free’, 762)
c. Feature extraction process.

After creating the dictionary, you will extract word count vector (features) of 3000 dimensions for each email of training set. Each word count vector contains the frequency of 3000 words in the training file. Most of word counts in your training file may appear to be zero. For example, assume you have 500 words in the dictionary. Each word count vector contains the frequency of 500 dictionary words in the training file. Assume that the text in training file was “Get the work done, work done” then it will be encoded as:
[0,0,0,0,0,…….0,0,2,0,0,0,……,0,0,1,0,0,…0,0,1,0,0,……2,0,0,0,0,0]
Here, all the word counts are placed at 296th, 359th, 415th, 495th index of 500 length word count vector and the remaining are zero.
You will need to code below to generate your feature vector matrix whose rows denote 700 files of the training set and columns denote 3000 words of dictionary. The value at index ‘ij’ will be the number of occurrences of jth word of dictionary in ith file.

def extract_features(mail_dir):
files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
features_matrix = np.zeros((len(files),3000))
docID = 0;
for fil in files:
with open(fil) as fi:
for i,line in enumerate(fi):
if i == 2:
words = line.split()
for word in words:
wordID = 0
for i,d in enumerate(dictionary):
if d[0] == word:
wordID = i
features_matrix[docID,wordID] = words.count(word)
docID = docID + 1
return features_matrix

d. Training the classifiers.

Similar to our examples in the class, you will use scikit-learn ML library for training your SVM classifier. After copying the above functions in your classifier_lingSpam.py file, make sure you import all the required packages in the beginning of your file to avoid compile or run time errors.
You will train your Support Vector Machine (SVM) classifier. Your SVMs will be supervised binary classifiers which are very effective when you have higher number of features. The goal of SVM is to separate a subset of training data from the rest of the instances which are called support vectors (boundary of separating hyper-plane). The decision function of SVM model that predicts the class of the test data is based on support vectors and makes use of a kernel (linear, rbf or others).
After training your classifier, you will need to check its performance on the test-set. The code below will extract word count vector for each mail in test-set and predict its class (ham or spam) with the SVM classifier. Make sure you include the above two functions (create_Dictionary, and extract_features) that you created in steps 2 and 3.

# import required libraries (operating system (os), numpy, Counter from Collections, SVM etc)

# Create a dictionary of words with its frequency

train_dir = ‘train-mails’
dictionary = create_Dictionary(train_dir)

# Prepare feature vectors per training email and its labels

train_labels = np.zeros(702)
train_labels[351:701] = 1
train_matrix = extract_features(train_dir)

# Training SVM classifier

model = LinearSVC()
model.fit(train_matrix,train_labels)

# Test unseen emails for Spam
test_dir = ‘test-mails’
test_matrix = extract_features(test_dir)
test_labels = np.zeros(260)
test_labels[130:260] = 1
result = model.predict(test_matrix)
print confusion_matrix(test_labels,result)
print accuracy_score(model)

# Write the code below to save your model (you can use pickle)

e. Model performance evaluation

Test-set of the
Ling-spam
dataset contains 260 emails (130 spam and 130 non-spam emails). Your print (confusion_matrix) line in the above code should show the results similar to the table below. The diagonal elements represent correctly identified (i.e., true identification) emails and non-diagonal elements represent false identification of emails.

SVM(Linear)

Ham

Spam

Ham

126

4

Spam

6

124

The above classification shows good predictive accuracy, even on the test data that was neither used in creating dictionary nor in the training set. You used LinearSVC in this example and you can also try SVC (with linear kernel as an option) and see if there is any improvement in the result. For help on SVC with linear kernel, please review the video and code we covered in Week 12.

Note: Please print your confusion matrix and accuracy score from this code and copy it in your assignment section 1. You can provide a one-page (maximum) description of your observations from this test work in your report. Please try to be precise and brief in your description.

2. Classifier_enron.py for Enron-spam dataset (Enron-spam.zip) – 7%

Download the pre-processed form of
Enron-spam
corpus that contains 33,716 emails in 6 directories. Each of the 6 directories contains ‘ham’ and ‘spam’ folders. Total number of non-spam emails and spam emails are 16,545 and 17,171, respectively.
Follow the same steps described in section 1 and check the performance evaluation of your Support Vector Machine with this new dataset. Please note that the directory structure of this corpus is different from the directory structure of
Ling-spam
dataset, and you might have to either reorganize your
Enron-spam
dataset to apply def create_Dictionary(dir) and def extract_features(dir) functions or make modifications in them to support new directory structure.
Depending on which pathway you take from the above, divide your dataset for training and testing purpose based on 60:40 split (60% for training and 40% for testing). Feel free to look for online help on how to manually split data into training and test sets as you might find that easier for this section. You should print confusion matrix and accuracy score for your model for the new dataset and copy the results in your assignment report section 2 (for enron-dataset).
Part (e) of the previous section (Performance evaluation) can be followed in this part to generate your confusion matrix and accuracy score for your new model on enron-dataset
. In this file, you should also implement the code at the end to save the model for evaluation by the staff. You will then write code in the evaluation.py to load your saved model, make predictions on the test set that you created on your computer and print the confusion matrix and accuracy score.

Evaluation: You will copy the code in your evaluation.py file to only load your saved model from section 2, use extract_features function to load test_matrix, make model predictions for this dataset and then finally print confusion matrix and accuracy score. Use these metrics (confusion matrix and accuracy score) to copy in your assignment report section 2.

Note: For this work, please write the description of your work in the report (2-page maximum) that may include your modifications in the helper functions to support new directory structure, use of SVC model instead of LinearSVC and your observations for this work.
3. Assignment Report – 5%

Your report will be based on two sections. Section one will explain your observations and work for the Ling-spam dataset. Section two will explain your work on Enron-spam dataset which will include your explanation of function modifications and use of SVC (within linear kernel) instead of using LinearSVC. Please use screenshots (or manually edited versions) of your confusion matrices for both sections as well as the accuracy scores in this section to support your explanation.

1

Assignment 1 (15%)
 
 
 

CSE 5120 (Section ##) – Introduction to Artificial Intelligence – Spring 2021 

 

 

Submitted to  

Department of Computer Science and Engineering
California State University, San Bernardino, California 

 

by

Student name (CSUSB ID)
(Your collaborator in this homework (if any))
 

 
Date: Month Day, Year

 
 
 
 
 
 

Email:

· Your email

· Your collaborator’s email (if you collaborated with any)

Assignment Report

Brief description of your work here acknowledging your collaboration with your class fellow (or a friend from other CSE 5120 section), and the capacity at which he/she collaborated with you, followed by the algorithms you implemented.
1. Classifier_lingSpam.py for Ling-spam dataset

Your brief explanation of the dataset, your code solution, and any documentation with screenshots of your code Evaluation (results from classifier_lingSpam.py)
2. Classifier_enron.py for Enron-spam dataset

Your brief explanation of the dataset, your code solution, and any documentation with screenshots of your code Evaluation (results from classifier_enron.py)
3. Evaluation (evaluation.py) for your model performance evaluation – optional

You can also provide brief description of your code written in evaluation.py to load the saved model that can be readily used on test dataset for the staff. This section is optional, and you can skip it

Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more
Open chat
1
You can contact our live agent via WhatsApp! Via + 1 929 473-0077

Feel free to ask questions, clarifications, or discounts available when placing an order.

Order your essay today and save 20% with the discount code GURUH