Predicting Suicidality Using Natural Language Processing Concepts as Linear Regression Features


A research study conducted by the CDC examined data from 1999 to 2016 found that suicide rarely has a single factor, and more than half of the suicide population have not been diagnosed with a mental health disorder or condition.1 This finding helped cement the fact that suicide is not easily predicted and is not tied to a single risk factor which makes predicting this behavior incredibly tricky.

Adding to the difficulty of predicting suicidal behavior is the fact that, according to many estimates, nearly eighty percent of a patients’ medical record is in the form of unstructured data.2 Unstructured data contains vital information necessary to diagnose suicidal behaviors as it is important enough for a physician to dedicate the time to compose; however this data essentially becomes inaccessible to future pysicians due to the overwhelming amount of unstructured data, the lack of time a physician has with a patient, and the lack of proper search mechanisms aimed at unlocking the hidden data in unstructured fields.2

Using Natural Language Processing (NLP) alongside Machine Learning (ML), specifically Linear Regression, will make possible early warning metric that can be calculated for each patient in the United States and distributed to clinicians and practitioners to help identify patients who may be susceptible to suicidality.

Using a bottom-up NLP process, reveals hidden and overlooked information contained in any unstructured dataset(s). The goal is to extract key concepts, phrases, and terms, using robust Linear Regression algorithms that can be applied to aid in the prediction of suicide or any similar ailment. As a proof of concept, Reddit posts and comments were analyzed across a variety of subreddits which are frequented by suicidal users (Redditors) of the service seeking help from their fellow Redditors.

Redditors who posted on identified subreddits are used in the training dataset based on their post or comment. A Suicidal Index Value (SIR) between 0 and 1 certainty were generated by the algorithm where 1 was extremely likely to commit or attempt suicide and 0 is extremely unlikely to commit, or attempt, suicide in a given prediction period which can be customized in the algorithm.

Problem Statement

In 2016 alone, nearly 45,000 Americans above the age of ten committed suicide. Suicide is the tenth leading cause of death in the United States.3 Of the top ten reasons for death, only three are on the rise suicide being one of them. As a result of these staggering numbers, significant research and studies are being conducted to combat this rising epidemic that plagues our nation.

In 2018, the U.S. Department of Veterans Affairs released a statistic3 stating that nearly twenty-two service men and women, both active-duty and veterans combined, commit Suicide daily! That sums up to approximately 7,500 service men and women who are lost annually as a result. Non-standard solutions in the form of technology are needed to aid in this fight.

While conducting informal interviews with active clinicians, one of the common issues nearly all expressed to me was that not having enough time with each patient to catch red flags that may exist in their medical records. Technology has failed both these patients and the clinicians. The common complaint was that the data is there; however, it is not accessible promptly. Most physicians interviewed stated they have between five and fifteen minutes with each patient -- with as little as five minutes of that being spent looking at their medical data.

Technology has successfully replaced the pen and paper charts of the past, at least, for the most part; however, the front-end search mechanisms attached to the databases driving Electronic Health Records (EHR’s) have not kept pace. An EHR is not useful if the data contained is not accessible in a way that a practitioner can access in a matter of seconds. In the same way Google has made the web accessible, we must create similar applications for patient data, both web-based and mobile.


Considering the percentage of information that is unstructured in a patient’s medical record, the need for Natural Language Process (NLP) becomes immediately apparent. All NLP analysis in this paper was done using Python and the Natural Language Toolkit (NLTK) that is offered from nltk.org4. All data used is from open source platforms. The data has been de-identified in all ways possible, including replacing usernames with enumerated values, replacing email addresses with "@Email", replacing phone numbers with "###-###-####", replacing /u/SomeUser references with “@UserReference”, and other similar approaches that do not change the context of the text.

Another important note: It is not suggested to clean the data in any way, shape, or form as it can change the intent behind the text. A robust bottom-up algorithm can handle poor punctuation, formatting, and misspellings. Even simple substitutions can have dramatic effects on the outcome.

The first step is to break a text into paragraphs, then to sentences, and ultimately into words with their associated POS Tags using tokenization. In my experience, breaking a text into paragraphs is relatively simple and can be done using the new line character(s) specific to the operating system. Tokenizing paragraphs into sentences and sentences to words is much more complicated due to poor grammar, punctuation, and slang. Generally, the Punkt Sentence Tokenizer works efficiently when splitting subtext into sentences and the built-in word tokenizer works well to split sentences into words although it is recommend to custom tokenization routines that are tailored to the specific data being analyzed for the best results.

It is at this point that the approach can vary widely based on the text being analyzed. Social media posts are more likely to contain slang and jargon than a note or comment in a medical file; and thus, a wide variety of approaches can and must be taken to extract as much meaningful information as possible from the text. A base starting point for any text is to extract noun phrases using an n-gram in the form of a regular expression. Each word has a POS Tag associated with it which makes regular expressions a rapid and efficient method for analyzing text and extracting concepts. Ultimately, the extracted concepts will act as a feature in the Linear Regression algorithm.

The goal of any regular expression should be to capture two to three-word phrases that are as clear and concise as possible with as few filler words as possible. Slang and jargon are acceptable and can add a considerable amount of precision, if appropriately captured. As concepts are captured, a table should be created along with a counter of how many times a concept has been identified in the given text. It is suggested to use a dictionary for this process where the key is the extracted concept, and the value is the number of times the concept has been identified in the given text. Dictionaries also allow for simple logic that can ensure values are added or updated, as appropriate, without losing much performance. The number of occurrences can be used if you wish to apply weights to features.

The final column added to our dataset is whether the concept was found in text that was associated with a Suicidal record. The final column will be used as our forecast column and should be a float between 0 and 1 with 0 being a completely non-Suicidal record and a 1 being a confirmed Suicidal record with a 0.5 as a neutral or unknown value. Training set data was all set to a value of 1 in the testing done with smaller values added for training data that came from non-Suicide related subreddits.

Before a concept can be used as a feature, it must be enumerated, and a numeric value must be assigned. Features, in my approach, must consist of numeric values only, although enumerating each concept has no impact on the results provided the algorithm does not scale the values provided. Scaling can easily be turned on and off in most Linear Regression algorithms and must be done to avoid improper training. The enumeration of values must remain consistent in the adopted approach so the results can be replicated.

A training dataset can come from a range of sources including the patient’s medical profile, psychological exams, social networks like Twitter, Reddit, Facebook, and even chat logs from support groups like the Veterans Crisis Line (VCL). The goal should be to have as many potential data sources as possible and allow the algorithm to decide based on historical patterns and trends that it can identify. It is important to note that the training set should consist of individuals who have either attempted or committed Suicide and have roughly the same demographics. The algorithm will only be as good as the data used in the training set. A different set of n-grams should be used for each data source and will require tuning as trends develop in society. It may also be necessary to generate several regression lines, one for each data set, and then use the combined results from each linear regression to calculate the overall SIR value.

Additionally, the larger the training set, the more accurate the algorithm will become. The algorithm will continue to learn as more data becomes available over time, and more accurate training sets are created. The algorithm should be retrained as frequently as possible to ensure the newest data points are calculated into predictions.

A pickle file is created with the most recent training dataset results so it can quickly be loaded and retrieved when predictions are needed. This dramatically reduces the overall time necessary to create a prediction as the entire training process is not necessary each time a prediction is calculated. It is necessary to generate new pickle files as new training data becomes available.

Ultimately, the goal is to provide an SIR value that is a range of 0 to 1 where a 0 is no indicated risk of Suicide, and 1 is an imminent risk that should be acted upon immediately by family, friends, or medical professionals based on the training sets identified. The range is provided by using the accuracy value generated by the Linear Regression algorithm and represents how confident the algorithm is based on the input features. For the purposes of this paper, the algorithm was set to predict 30 days into the future, although that could easily be scaled with a larger data set.


Using Reddit information ranging from 2010 to 2017, which includes all posts and comments, a Linear Regression Algorithm was constructed using a best-fit approach that generates an SIR score for any user on the platform based on their post and comment history. Subreddits dedicated to helping those with suicidal thoughts and tendencies were gathered and identifying features from the training set were replaced with the values discussed above to ensure anonymity. It was assumed that people who posted to the identified subreddits were, in fact, suicidal and thus, were good candidates for the training dataset. Users who were identified as moderators and regular contributors of the subreddit were not included to avoid skewing the training set data. These high-risk individuals were given a score of 1 in the forecast column.

Additional posts and comments were added to the training set from a wide range of subreddits that were low and medium risk. These posts and comments were given forecast values ranging from 0 to 0.95 depending on the source and type of content being discussed on the subreddit. It is important to give the training set as wide a range of data as possible to aid in the classification process.

The training set consisted of nearly one million posts and comments and at the end of the training was able to consistently produce SIR values above 87% given non-training data from 2017 to 2019 from subreddits like /r/SuicideWatch, /r/Depression, and /r/SuicidePrevention. To further test the SIR value produced by this algorithm, a test sample of nearly one million users was selected from subreddits of ranging topics that were selected to be non-suicidal risks. SIR values averaged 14% using comments and posts from low risk subreddits like /r/Aww, /r/AdviceAnimals, and /r/funny.

Several different n-gram regular expressions were used, tested, and implemented that extracted simple Noun Phrases to custom tailored, sophisticated, n-grams used to pick out single words or entire sentences. Rather than using a single regular expression to extract only noun phrases, other n-gram regular expressions were used in conjunction that helped improve scores and generate a more accurate SIR value. In testing, it was found that proper noun phrases specifically were prominent and increased the SIR value tremendously when appropriately captured. Each n-gram was used as an independent feature in the training set which currently consists of six different n-gram regular expressions that are converted to numeric values and used as features.

Taking this technology a step further, it could become an opt-in website and mobile application that could allow people the ability to join the service, link their social profiles and other datasets, and then have real-time alerts based on the activity of the person being monitored sent to them, their family members, and their healthcare physicians. I imagine a world where a physician or law enforcement could have real-time alerts sent to them based on the activity of a patient that can be acted upon immediately to prevent a rash action or decision from becoming fatal.

Known Issues

The specific algorithms and POS regular expressions can become extremely complicated and must be tuned for each data source. As previously noted, a social network must consider slang and jargon which is much more likely to be poorly formatted. Lacking punctuation can also become an issue, as sentences and words begin to spill over, which can introduce inconsistent concepts; however, a weighting method can be used to tune this out of the algorithm.

There are many ways to communicate in non-textual ways online. Emojis are a great example. It is very difficult to capture them in a natural language processing algorithm since they are usually represented in different ways. Emojis are an excellent source of input, although they are excluded from the outcomes above as there are too many inconsistencies introduced by them.

A significant sample size of known suicidal patients must be obtained and confirmed before this technology can be rolled out in any substantial manner. Using social media datasets alone is not enough to confirm whether a person is genuinely suicidal, but they can be used as a member of the training dataset. Many assumptions are made based on the source of the data being analyzed and those assumptions could introduce inconsistencies in the features and concepts.

The Reddit data used to train the Linear Regression Algorithm is inherently age biased. According to, the average age of a Reddit user is between 18 and 29 years of age. Thus, their speech patterns and concepts will not apply well to a Vietnam Veteran. Additionally, the top concepts from the Reddit data set included terms like school, mother, father, and bullying. These are not the concepts we would find in a Veteran’s profile if we applied our NLP technologies to their data.

Allowing algorithm access to social media accounts, medical history, and other aspects of a person's life can seem intrusive. Privacy is crucial, and the wishes of the patient must be considered. An easy to use opt-out policy must be in place and properly enforced as well to ensure no violations of privacy occur.


With this technology, patients, physicians, and even family could help monitor those around them via an easy to use web application or even a mobile app. Patients could "link" their social media profiles, and physicians could input important medical notes and comments that could all be used to generate and track a score over time. Given enough time and data, the SIR value could help reduce the suicide rate dramatically, especially in the U.S. Department of Veterans Affairs.

Given the limited scope of data available for this paper, I believe more research should be conducted by research groups and organizations like the VA to formalize the SIR value and incorporate it into a patient’s health record. Additional research should be focused on tuning n-gram regular expressions and exposing those algorithms to more accurate and research-approved data.

The VA Text Integration Utilities (TIU) would be an ideal dataset since it is vast, accessible, associated with individual patients, and can make an immediate impact on how we treat veterans who are struggling with psychological and emotional disorders. An accurate SIR value could be the turning point in the battle the against veterans’ suicide.

With the right amount of data and tuning efforts the SIR value could be applied to any ailment that may exist, specifically psychological ailments like Post Traumatic Stress Disorder (PTSD) or to those with severe brain injuries or history of concussions. In any of these situations having a friend, family member, or physician alerted to a person’s status, in real-time, could be the deciding factor between life and death.

These extracted concepts could also be used to help identify trends among suicidal veterans. Concepts that are more frequently identified could be used to help fast track treatment in others who may have been exposed to similar situations in combat or life. Distributing commonly extracted concepts from suicidal veterans TIU notes to clinicians could also prove to be invaluable as the concepts would be data driven solutions to a real problem.


The information in this paper is not to imply that a skilled clinician or practitioner can be replaced or supplanted. The intended purpose of this prediction and technology would be to aid skilled clinicians and practitioners in their decision-making. The goal is to provide organization to disorganized information and present it in a manner that is quickly interpreted with little to no training. This technology should be used as an aid only and not a replacement.


Parts of Speech (POS) refers to the individual parts of speech contained in a sentence. Nouns, pronouns, adjectives, and verbs. Each of these is represented by a one to three letters representation.6

POS Tagging is the act of tagging each word with the part of speech that it represents. Think back to your first grade English classes where you had to draw a shape around each word in a sentence and identify if it was a noun, verb, etc.6

Tokens or Tokenization is the act of splitting text into smaller subtexts. Subtexts could be by page, paragraph, sentence, and even individual words.4

Chunking is a process of extracting a phrase from any given token based on a regular expression. For obvious reasons the token must be a sentence, not an individual word.4,5

N-Gram is a contiguous sequence of N items from a given sample of text. In my implementation, I use regular expressions to identify these contiguous sequences using the POS tagged to a word.

Pickle is a serialized file that is generated after training an algorithm which can save the training results and quickly be loaded without having to retrain the algorithm.

Linear Regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables, referred to as a feature. In this example I am using multiple linear regression since I use multiple features.

Bottom-Up NLP is the parsing and recognition of a given text that builds a bigger and bigger piece of structure using this information. Essentially bottom-up NLP is moving from concrete low-level information to more abstract high-level information.


[1] CDC. Suicide rates rising across the U.S. 7 June 2018. <>.

[2] Hon S. Pak, MD MBA, Chief medical Officer, 3M HIS. Unstructured Data in Healthcare. n.d. <>.

[3] Wentling, Nikki. VA reveals its veteran suicide statistic included active-duty troops. 20 June 2018. 10 05 2019. <>.

[4] NLTK. Natural Language Toolkit. 11 January 2019. <>.

[5] D'Souza, Jocelyn. Learning POS Tagging & Chunking in NLP. 4 April 2018. <>.

[6] Penn State. Penn Treebank Project. n.d. <>.

16 views0 comments