Text Analytics For Suicidal Ideation Using NLP Techniques
An in-depth explanation of emotions and topic modelling for suicidal ideation
Suicide ideation is viewed as ones’ has a suicide attempt tendency with a plans. People who become suicidal and seek suicide as the only solution when they feel overwhelmed by life challenges. The risk factor believed to increase impulsive suicide behaviour such as history of substances abuse, access to firearms, difficult life events, isolation from others, history of mental illness, history of physical or sexual abuse, chronic illness, and past suicide attempts.
The rise of the use of social media platform, it is become a “venting window” to provide a space for users to utter their suicidal thoughts while remaining anonymous where suicide is society stigmatized openly discuss topics. The suicidal notes can be noticed by accompanied by a clear marker like killing themselves, their life has no purpose, feeling like a burden, feeling stuck, not wanting to exist, etc. Suicidal notes always convey feeling, emotions, and behaviour. Some words have semantic core emotion of the feeling, for example, dejected and wistful associated with some amount of sadness. On the other hand, some words may not denote the effect but it associates some degree of emotions, for example, failure and death are usually accompanied by sadness.
Objectives
Use NLP techniques to pin the problems and perhaps measure up the preventions.
● What is the emotion when people have suicidal thoughts?
● What is the unbearable problem they had when people seek to end their life?
● What are the keywords to the respective topic in the suicidal notes?
● What is the greater emotion to the respective problem they had?
Methodology
We use various NLP and text mining techniques in this study. Figure 1 below depicts the methodology flow of the study.
Data Collection
To collect suicidal posts, we retrieve data from two sources. We web-scraping the posts from a subreddit: Suicidal_Thoughts where users can express their thoughts via texts. And also we downloaded ready suicide notes dataset from Kaggle. After combining the data, a total 1490 posts in original size. We perform the first stage of cleaning, removing the empty posts and removing the duplicate posts.
Data Cleaning
For second stage text data cleaning, we use NLTK library to handle the text preprocessing tasks.
● Tokenization
For many NLP tasks, we need to access each word in a string. We first use nltk’s .lower() to convert the texts into lower cases. To access each word, we tokenize the text into individual tokens, we use nltk’s word.tokenize() function to tokenize the string of text returned into a list of words.
● Cleaning (Noise Removal)
Then we further treatment on the unwanted information such as punctuation, non-alphabetical tokens or special characters and empty token posts. We use the .sub() method in Python’s regular expression (re) library to clean the noise. And also, we remove one-token posts which do not make sense for text analysis.
● Normalization
Next, we further process through normalization by removing stop-words and lemmatization. We use nltk’s stopwords.word(“English”) function to remove the default stop-words. In lemmatization, it reduces the inflected words properly ensuring the root word belongs to the language. For example, “run”, “running” and “ran” belong to the lemma word of “run”. We use the WordNet Lemmatizer package to lookup lemmas words.
However, after lemmatization has been done, we still find there is not clean enough such as spelling error or contraction of words like “idk”, “tbh”. Therefore, we further extract the words and indicate them as non-dictionary words. To further process the non-dictionary words, we first use SnowballStemmer(“English”) package to stem the words then apply Speller() to perform spell check and autocorrect. With that, we managed to autocorrect 80% of the non-dictionary words. We then manually annotate and correct for the rest of the 20%. We inspect the high-density non-dictionary word, then 1. Manually correction and update, 2. Update the certain word directly to the dictionary, 3. Assign the word as stop-word.
After the text is cleaned, we proceed to organize the texts into Document Term Matrix (DTM). We use scikit-learn’s CountVectorizer for this task, DTM every row represents a different document (post) and every column represents a different word.
After that we save the after clean data and document term matrix as a pickle file.
Data Exploratory Analysis
● Post’s Length
Figure 2 shows the length of each post can range from 2 tokens to over 800 tokens a post, where 94% of the posts are less than 200 tokens.
● Part-of-Speech (POS) tagging Frequency
Figure 3 is the frequency of POS tag used in suicidal notes, the top five of POS tags are Nouns, followed by Verbs, Adjectives, Adverbs, and Interjections.
Figure 4 is a wordcloud of the top five POS. People are likely to express themselves in Nouns. That being said, noun terms are a keyword or a hint of the problem they faced such as “friend”, “family”, “life”, “thing”, “school”, etc. And Verbs are the feeling and thoughts they expressed such as the swear word “fuck”, “tired”, “kill”, “trying”, etc.
● Wordcloud
Uni-gram
Bi-gram
Emotion Detection
Emotion detection is a subfield of sentiment analysis (SA), where SA core intent is to analyze the polarity either positive, negative, or neutral. Emotion detection is sought to extract finer grain of emotions. Importance of identifying actual emotions rather than sentiment polarities. For example, “I am crying tonight (sadness)” and “I am furious, I shall have my revenge (anger)” are classified under “negative polarity”. However, the two messages convey different feelings.
Robert Plutchik was a psychologist studying emotions, he created Plutchick’s wheel of emotions classified primary emotions into eight elements that can be clearly distinguished, such as joy, anger, sadness, fear, disgust, surprise, anticipation and trust.
In this article, we are using National Research Council Canada (NRC) Words-Emotion Lexicon helps us to identify the emotions when people have suicidal thoughts through the notes.
Figure 7 shows the percentage of post emotions by count. From the pie chart, we can observe that suicide attempter are emotionally in “Sadness” and “Fear” when expressed in a suicidal post which takes up 74% of the posts. Anger emotion is 17%, and self-disgust is 3%, surprisingly, the emotion “Surprise” is 6% more than the emotion “Disgust”.
Figure 8 shows the percentage of post emotions by score. Can be observed that the rank of emotions in the posts is still the same, but the percentage is different. Sadness and Fear become more dominant compared to count.
This observation shows that emotions are complex and subtle. Every word can carry a different proportion of emotions which contributes to the whole semantic of the text.
Topic Modeling
Topic modelling is an unsupervised machine learning technique that is capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents. We based on POS tagging approach and tried a few attempts to find the best topic model. We use the Gensim Latent Dirichlet Allocation (LDA) Python library for topic modelling. We first transpose the document-term-matrix into the term-document matrix, then put it into Gensim format to create Gensim sparse matrix and Gensim corpus. Gensim also requires a dictionary of all terms and their respective location in the term-document matrix. Several parameters can be specified when training a topic model, including numbers of topics, alpha, beta and number of passes. As topic modelling is unsupervised, it is difficult to determine which model to choose. For each corpus (Full Text, Nouns, Nouns + Adj, Nouns + Verbs), a base model is trained with 5 topics, 50 passes, and default alpha and beta. The base model is then evaluated using the topic coherence measurement. Then, by using 2 passes, the best combination of numbers of topics, alpha and beta are tested. Then, the best model for each corpus is selected, then trained using 100 passes and evaluated again. The best model is chosen from the 4 models. Each post is then labelled with the topic using the topic model.
Figure 9 shows a table of comparison among different topic models. Different corpus is generated by filtering tokens with specific POS tags. Tuning is done on number of topics, alpha, and beta values to determine the best combination of that corpus. The highest coherence score is achieved by full text corpus, 3 topics, symmetric alpha, and beta of 0.01 with coherence score of 0.7910.
From the table, can be seen that grid search tuning does improve the coherence score of the topic model, but mostly less than 0.05 improvement.
From Figure 10, topic 0 and topic 2 takes up 92% of the posts. This can also be related to the highly similar nature of topic 0 and topic 1 in the word factors. It is possible that there’s only 2 topics available in the posts, but 2 topics are not considered in the grid search tuning.
Relating Topics to Emotions
From Figure 11 and Figure 12, we can observe a slight difference in the proportion of emotions in each topics using by count or by score methods. Labelling emotion by score shows higher sensitivity of topic difference compared to labelling by count. However, we can observe that topic 1 is only occupied 9% in the suicidal posts, but the emotion “Anger” is higher than topic 0 and topic 2 about 10%.
Conclusion
This article present comprehensive methods and analysis on the suicidal ideation. We observed that when people having suicidal thoughts, they are majority in “Sadness” and “Fear” of emotion-in-state compare to other negative emotions “Anger”, “Disgust” or “Surprise”. From the exploratory analysis, the high frequency of words expresses that are “feel”, “life”, “people”, “friend”, “family” could be the key abruptly change and unbearable problems they could not stand anymore. Three topics can be distinguished with unique keywords from topic modelling. The dominant emotion for topics respectively are still “Sadness”, and “Fear” occupied the highest, however emotion “Anger” in topic 1 is greater than topic 0 and topic 2 about 10%.