Disclaimer, for the sake of keeping these posts relatively short, I will not be conducting robust cross-validation, or using more advanced NLP techniques. This is purely to show a simple NLP multi-class classification task, and how we can interpret the results.

High Level Idea

Trying to convince your stakeholder that your model is a black box of magic that they can simply trust is, understandably, a hard sell. Shapley values are an absolute must and in my professional experience have been responsible for major buy in to a modeling effort.

Were going to do multi-class text classification, with 4 classes, show how to do summary plots, interpret an example for each of the 4 categories, and then show how an ambiguous sample is interpreted by all 4 classes.

To run this code you should use the requirements.txt as the versions of some packages are extremely important, and copy the model folder with the torch dense neural net implementation

Dataset: https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions

Code to run this yourself: https://github.com/AscherFriedman/AscherFriedman.github.io/tree/master/NLP_P1_Code

Includes modules that are in other directories and imports of packages

#Lets grab a csv containing historical information from the show jeapordy. Each Question belongs to a Category.
df = pd.read_csv('JEOPARDY_CSV.csv')
#Some of the columns come with odd names so lets strip white spaces before and after
df.rename(columns = {x:x.lstrip().rstrip() for x in df.columns}, inplace=True) 
df = df[['Question','Category']]
df.Category.value_counts().head(10)
BEFORE & AFTER             547
SCIENCE                    519
LITERATURE                 496
AMERICAN HISTORY           418
POTPOURRI                  401
WORLD HISTORY              377
WORD ORIGINS               371
COLLEGES & UNIVERSITIES    351
HISTORY                    349
SPORTS                     342
Name: Category, dtype: int64

As we can see there are many many categories and they are all relatively small.

Lets create 4 categories by combining others: History, Literature, Art, and Geography

#rename_cols = 
df.rename(columns = {x:x.lstrip().rstrip() for x in df.columns}, inplace=True) #EDA found spaces before and after column names
#Lets make some categories with 1500 observations each
df.Category.replace({'WORLD HISTORY':'HISTORY','AMERICAN HISTORY':'HISTORY','U.S. HISTORY':'HISTORY','THE CIVIL WAR':'HISTORY'},inplace=True)
df.Category.replace({'U.S. GEOGRAPHY':'GEOGRAPHY','WORLD GEOGRAPHY':'GEOGRAPHY',
                    'BODIES OF WATER':'GEOGRAPHY','ISLANDS':'GEOGRAPHY'},inplace=True)
df.Category.replace({'BOOKS & AUTHORS':'LITERATURE','SHAKESPEARE':'LITERATURE','FICTIONAL CHARACTERS':'LITERATURE','AMERICAN LITERATURE':'LITERATURE'},inplace=True)
df.Category.replace({'U.S. CITIES':'CITY_COUNTRY','STATE CAPITALS':'CITY_COUNTRY','WORLD CAPITALS':'CITY_COUNTRY'},inplace=True)
df.Category.replace({'ART & ARTISTS':'ART','BALLET':'ART','OPERA':'ART','MUSIC':'ART',
                    'CLASSICAL MUSIC':'ART'},inplace=True)
df = df[df.Category.isin(['HISTORY', 'LITERATURE', 'ART', 'GEOGRAPHY'])]
df.Category.value_counts()
HISTORY       1600
LITERATURE    1559
ART           1542
GEOGRAPHY     1507
Name: Category, dtype: int64

Lemmatizing and Speed

For a larger dataset I would suggest using the package pandarallel or using multiprocessing with an entire column at once as lemmatizing is known to be a slow process on large datasets. Mostly because it involves repeatedly using lookup tables for every word. Given this dataset is under 10,000 rows, lemmatizing is still quite fast.

stop_words = set(stopwords.words('english'))
wnl = WordNetLemmatizer()
# Two functions to do some basic cleaning, stripping odd characters and numbers, leaving alpha numeric.
def basic_clean(text):
    "Strips all non-alpha numeric text"
    # Regex pattern for only alphanumeric, hyphenated text
    return(re.sub(r'[^a-zA-Z -]', '', text))

def pre_processing(text):
    '''
    Tokenizes, Lemmatizes, Lower Cases, Removes Stopwords
    Joins back to Sentance'''
    lemma_list = ' '.join([wnl.lemmatize(x.lower()) for x in word_tokenize(text) if x.lower() not in stop_words])
    return lemma_list

df['clean'] = df['Question'].apply(basic_clean).apply(pre_processing)

Split Data

In another tutorial i’ll show how to better analyse results from a multi-class classification

#Label encoder for the the response variable
#Lets make the train and valid each 15% of the dataset
le = LabelEncoder().fit(df['Category'])
df['encoded'] = le.transform(df['Category'])
X_train, X_test, y_train, y_test = train_test_split(df['clean'], df['encoded'], test_size=.2, random_state=42,stratify=df['encoded'])

Very basic NLP algorithm

Convert to a tfidf matrix with 100 features, feed it into a 128,64,32 dense neural net with 4 classes using torch Then

#Vectorize to a tf-idf matrix max_features = 300 X_test_original = X_test.copy() #We save this tfidf matrix for the intepretation portion of this analysis tfidf = TfidfVectorizer(max_features=max_features) X_train = tfidf.fit_transform(X_train) X_test = tfidf.transform(X_test) feature_names = tfidf.get_feature_names_out ()

X_train = torch.tensor(X_train.todense()).to(device).float()
X_test = torch.tensor(X_test.todense()).to(device).float()

y_train = torch.tensor(y_train.values).to(device).float()
y_test = torch.tensor(y_test.values).to(device).float()
train_dataset = TFIDFDataset(X_train,y_train)
test_dataset = TFIDFDataset(X_test,y_test)

train_loader = DataLoader(dataset = train_dataset, batch_size = 32)
test_loader = DataLoader(dataset = test_dataset, batch_size = 16)
dataiter = iter(train_loader)
train_feat_ex, train_label_ex = dataiter.next()
model = dense_model(hidden_layers= [128,64,43], num_features = max_features, num_classes = len(le.classes_), p_drop=.15)

Again, this is a fairly small dataset and a non-robust cross validation procedure so these results are to be taken with a grain of salt, the point here is to highlight interpretation techniques, not performance

model, stats = train_nn(model, train_loader, test_loader, NUM_EPOCHS = 50, LR = .0001, LR_patience=5, early_stopping_callback= EarlyStoppingCallback())
  2%|█▋                                                                                 | 1/50 [00:00<00:43,  1.11it/s]

EPOCH 001: Loss: [Train: 1.36238 | test 1.24256 ] Accuracy: [Train: 31.964 | test 49.038] LR: 0.0001


  4%|███▎                                                                               | 2/50 [00:01<00:45,  1.05it/s]

EPOCH 002: Loss: [Train: 1.19640 | test 1.09368 ] Accuracy: [Train: 49.546 | test 64.103] LR: 0.0001


  6%|████▉                                                                              | 3/50 [00:02<00:43,  1.07it/s]

EPOCH 003: Loss: [Train: 1.04410 | test 0.94591 ] Accuracy: [Train: 61.679 | test 71.554] LR: 0.0001


  8%|██████▋                                                                            | 4/50 [00:03<00:43,  1.06it/s]

EPOCH 004: Loss: [Train: 0.92014 | test 0.84137 ] Accuracy: [Train: 68.717 | test 75.689] LR: 0.0001


 10%|████████▎                                                                          | 5/50 [00:04<00:41,  1.08it/s]

Message from callback (Early Stopping) counter: 1/5
no
EPOCH 005: Loss: [Train: 0.81978 | test 0.75311 ] Accuracy: [Train: 72.583 | test 77.420] LR: 0.0001


 12%|█████████▉                                                                         | 6/50 [00:05<00:40,  1.09it/s]

EPOCH 006: Loss: [Train: 0.72940 | test 0.68310 ] Accuracy: [Train: 75.935 | test 78.141] LR: 0.0001


 14%|███████████▌                                                                       | 7/50 [00:06<00:38,  1.11it/s]

Message from callback (Early Stopping) counter: 1/5
no
EPOCH 007: Loss: [Train: 0.66856 | test 0.63342 ] Accuracy: [Train: 77.958 | test 78.622] LR: 0.0001


 16%|█████████████▎                                                                     | 8/50 [00:07<00:37,  1.12it/s]

Message from callback (Early Stopping) counter: 2/5
no
EPOCH 008: Loss: [Train: 0.62259 | test 0.63164 ] Accuracy: [Train: 80.195 | test 78.990] LR: 1e-05


 18%|██████████████▉                                                                    | 9/50 [00:08<00:36,  1.12it/s]

Message from callback (Early Stopping) counter: 3/5
no
EPOCH 009: Loss: [Train: 0.61857 | test 0.62941 ] Accuracy: [Train: 80.442 | test 78.429] LR: 1e-05


 20%|████████████████▍                                                                 | 10/50 [00:09<00:36,  1.11it/s]

Message from callback (Early Stopping) counter: 4/5
no
EPOCH 010: Loss: [Train: 0.60936 | test 0.62494 ] Accuracy: [Train: 80.863 | test 79.151] LR: 1e-05


 20%|████████████████▍                                                                 | 10/50 [00:10<00:40,  1.00s/it]

Message from callback (Early Stopping) counter: 5/5
here
Training stopped -> Eearly Stopping Callback : test_loss: 0.6211988784563847

For the Shap package, deep explainer generally works fairly quick to nn.torch problems that arnt image classification

explainer = shap.DeepExplainer(model, shap.sample(X_test,25))
# Get the shap values from my test data
shap_values = explainer.shap_values(X_test)
%matplotlib inline

Functions to generate plots

Worth noting that the title option on shap plots is buggy, so we turn it off, add it manually matplotlib

def summary(feature):
    encoded = le.transform([feature])[0]
    shap.summary_plot(shap_values[encoded],X_test.cpu(), feature_names=feature_names, class_names=le.classes_,show=False)
    plt.title(feature,fontsize=16)
    plt.savefig(feature,bbox_inches='tight')
    plt.close()
    
def specific_example(feature,threshold,sample_num, save_fig = False):
    encoded = le.transform([feature])[0]
    sample_index = np.where(y_test.cpu().numpy()==encoded)[0][sample_num]
    shap_vals = shap_values[encoded][sample_index]
    use = shap_vals*[abs(shap_vals)>threshold][0]
    shap.bar_plot(use,feature_names=feature_names,max_display=(abs(use)>0).sum(),show=False)
    sentence  = df.loc[X_test_original.index[sample_index],'Question']
    plt.title(f'{feature} sample: "{sentence}"')
    if save_fig:
        plt.savefig(feature+'_barplot',bbox_inches='tight')
    plt.close()

Create Summary plot for all 4 classes

for feature in le.classes_[:]:
    summary(feature)

Note that in these shaply plots a word either exists or it doesnt, so blue just means no word, and red means words are in the sentance we are plotting.

ART

Here we can see that words such as “Ballet”, “Artist”, “Painter” and “Opera” tell the model that the question belongs to category ART. If it sees “Island” or “Novel”, it is a good indication that it is not art.

Geography

“Island”, “Gulf”, “Lake” and “Mile” tell the model that the question belongs to category Geography. If it sees “Wrote” or again “Novel”, it is a good indication that it is not geography.

History

History is an interesting case because the model is primarily defined by which words are not present. So if it sees “Novel” or “Artist” or “Wrote”, it knows its not history. Some words are still positive Indicators, like “War”.

Literature

“Novel”, “Wrote”, “Published” and “Book” tell the model that the question belongs to category Literature. If it sees “island” or “Country”, it is a good indication that it is not literature.

“Novel” ends up being important because theres a category that we combined into Literature called “Novels” so each question in that subcategory had the word Novel in it.

Picked some specific examples that are interesting to look at

specific_example('ART',.1,3, save_fig=True)
specific_example('HISTORY', .1, 1, save_fig=True)
specific_example('GEOGRAPHY', .1, 5, save_fig=True)
specific_example('LITERATURE', .1, 9, save_fig=True)

ART

“Opera” and “Composer” indicate its ART, but “California” and “West” indicate its not art! Still we can see the positives are much larger than the negatives and the model still likely choses art.

Geography

“Lake and “Part” tell the model its geography, but “Home” is a small hint it might not be (home probably belongs frequently to literature and history.

History

“Treaty” and “Country” tell the model it is likely History, but “French” says it may not be, probably because of many French Artists.

Literature

“Shakespear”, “Comedy”, “Love”, all hallmarks of literature, but so many literature questions contained the word “Novel” that the lack of this word makes the model uncertain

Lets Pick an interesting and very confliting case and see how the model evaluates the likelyhood of each category.

“Tradition says this country’s first emperor, Menelik 1, was the son of King Solomon and the Queen of Sheba” This is a history question.

for feature in le.classes_:
    encoded = le.transform(['HISTORY'])[0]
    sample_index = np.where(y_test.cpu().numpy() == encoded)[0][32]
    shap_vals = shap_values[le.transform([feature])[0]][sample_index]
    use = shap_vals*[abs(shap_vals) > .05][0]
    shap.bar_plot(use,feature_names = feature_names,max_display = (abs(use) > 0).sum(),show = False)
    sentence  = df.loc[X_test_original.index[sample_index],'Question']
    plt.title(f'{feature} sample: "{sentence}"')
    plt.savefig(feature+'_one_sample_mult_feat_',bbox_inches='tight')
    plt.show()

ART

The presence of words like “Country” and “Say” and “First”, as well as the absence of words like “Artist” and “Painter” all indicate this is not ART. Art is ruled out!

Geography

Geography sees words like country that give it some confidence, but “Say” and “King” are enough to know its not Geography. Geography is ruled out!

History

Here the model sees “King” and “Country” giving it confidence, but also “Say”, “First” and “Son” which makes it doubtful. Say usually belongs to literature, as does Son.

Literature

Here the model sees “Say”, “First” and “Son”, which are frequently found in literature, but it still has doubts given that it sees “Country” and “Emperor” and no “Novel”

print('True Label:',df.loc[X_test_original.index[sample_index]].Category)
probs = np.array(softmax(model(X_test[sample_index:sample_index+1]),dim=1).detach().cpu())[0]
print()
for category,prob in zip(le.classes_,probs):
    print(f'{round(prob*100,2)}% chance of being {category}')
print()
print('Modle Predicts: ', le.classes_[probs.argmax()])
True Label: HISTORY

14.71% chance of being ART
17.11% chance of being GEOGRAPHY
33.44% chance of being HISTORY
34.74% chance of being LITERATURE

Modle Predicts:  LITERATURE

The model is just barely wrong! With a little hyperparameter tuning, or simple tricks like adding in bi-grams, i’m sure it can figure it out!