Relevance Words Scoring using TextRank
TextRank is a model for text processing that can be used to find the words relevances in text. This model is proposed by Rada Mihalcea and Paul Tarau on 2004 in a paper titled, “TextRank: Bringing Order into Texts”. As the inspiration of PageRank, TextRank also uses the graph model as its base.
In the paper, Mihalcea and Tarau represent that TextRank not only usable for extracting the most relevance words in text (they call them as keyword extraction), but also applicable for extracting most influence sentences in text (also, they call them sentence extraction). As for sentence extraction, TextRank has been many used in text summarization. But in this article we just dive on the keyword extraction.
As mentioned earlier, TextRank uses graph as its model structure to find the association between words. For words relevance, the vertices of the graph, as you have thought, are the words and the edges are the connection between the respective words. The basic idea of this model is the co-occurrence between two words in defined window. The model look for with whom one word connect with.
For the complete process is:
- Preprocessing: like other text processing, to get the result as optimal as possible, we need to preprocess the text. It includes:
- Filtering, filter the text so the text will only have the common characters. In python, string.printable will shows you what the characters are.
- Tokenize the text, split the text into unit of words
- Lemmatization, normalize the word into its base form
- Part-of-Speech (POS) tagger based filtering, tag the each words with its POS tag, then filter the words which is not an adjective, noun, or gerund. Of course it also could be custom, just did the best fit for your problem.
- Stopword removal, in the previous step we have removed words that most likely didn’t describe what the real content is. In this step, we will remove the complete stopwords.
This preprocessing list is not a ‘must-do-list’. Some steps may not be required for some problems and may be another steps besides the list is required. Just do the best fit for your problem by making some experiments. After all, just like John Backus (the inventor of Fortran) said, “You need the willingness to fail all the time. You have to generate many ideas and then you have to work very hard only to discover that they don’t work. And you keep doing that over and over until you find one that does work.”
- Model Building, after making ruckus with ‘Preprocessing’, we are now arrive in the core of TextRank, Graph Building. In the paper, Mihalcea and Tarau introduce the undirected and weighted graph. Each ‘descriptive word’ become the vertex of the graph and the connection between the words within a window of N words become the edges. After the graph is constructed, we initialize all weights of the edges by 1. That’s our kick-off graph!
- Relevance Scoring, with the model has been constructed, we are ready to train it to get us the relevance words in the text. TextRank also uses the same scoring method with PageRank, but with modification. The definition is:
Si = (1 – d) + d * (summation of ((Wij/Inoutj) * Sj))
Si is the score for vertex i. d is damping factor. It is say to be a probability of a user to click on links at random and is usually set to 0.85 according to Larry Page and Sergey Brin. Wij is the weight of the edge between vertex i and vertex j. Inoutj is the number of connected vertices with vertex j. Sj is the score of vertex j. The scores are updated iteratively, until the difference of current scores and previous scores is less than the threshold of 0.0001 (convergence). And here is the optional step. If you want to have the relevance score in percent, then just apply the normalize score at below:
NSi = (Si – min(S)) / (max(S) – min(S)) * 100
NSi is normalized score for vertex i. min(s) is the minimal value of list of scores S and max(S) is the maximal value of the list.
For the example case, I got an Indonesian news article. It is about the Indonesians worried to terrorism since the Surabaya bombings on May 2018. The text is:
Indonesians have become more concerned about terrorism since the Surabaya bombings, according to a recent survey. The Indonesian Survey Institute (LSI) found that 82 percent of Indonesians, regardless of gender, religion, income, education and political affiliation, are more concerned about terrorism as a result of the Surabaya suicide bombings in May. “We’ve witnessed a new development in terrorism where women and children are now involved [in terrorist attacks],” said LSI researcher Ardian Sopa on Tuesday in Jakarta. The Surabaya bombings shocked the nation as 10 minors were used in a string of bomb attacks. Only three of them survived. The survey, which interviewed 1,200 respondents above the voting age, also indicates that most Indonesians favor stronger antiterrorist measures and support the new Terrorism Law, which is described by some human rights groups as draconian and prone to abuse. More than 76 percent of respondents approved longer detention periods for suspected terrorists as stipulated in the new law, according to the survey. Furthermore, 70 percent of respondents also supported increasing the annual budget for the police in order to crack down on terrorism. However, the survey also showed that 53 percent of respondents felt that civil society, including the media, NGOs and religious organizations, had fallen short of raising awareness about terrorism. “The police have been quite successful in curbing terrorism but the people also want civil society to better raise awareness to support the police,” Ardian said.
And then, I applied the preprocessing above. Again, it’s not a must, if you have some other cool preprocessing-list just go ahead. The appearance of the text now is:
[‘indonesian’, ‘concerned’, ‘terrorism’, ‘surabaya’, ‘bombing’, ‘survey’, ‘indonesian’, ‘survey’, ‘institute’, ‘lsi’, ‘percent’, ‘indonesian’, ‘gender’, ‘religion’, ‘income’, ‘education’, ‘political’, ‘affiliation’, ‘concerned’, ‘terrorism’, ‘result’, ‘surabaya’, ‘suicide’, ‘bombing’, ‘may.weve’, ‘development’, ‘terrorism’, ‘woman’, ‘child’, ‘involved’, ‘terrorist’, ‘attack’, ‘lsi’, ‘researcher’, ‘ardian’, ‘sopa’, ‘tuesday’, ‘jakarta’, ‘surabaya’, ‘bombing’, ‘nation’, ‘minor’, ‘string’, ‘bomb’, ‘attack’, ‘survey’, ‘respondent’, ‘voting’, ‘age’, ‘indonesian’, ‘favor’, ‘strong’, ‘antiterrorist’, ‘measure’, ‘terrorism’, ‘law’, ‘human’, ‘group’, ‘draconian’, ‘prone’, ‘percent’, ‘respondent’, ‘long’, ‘detention’, ‘period’, ‘suspected’, ‘terrorist’, ‘law’, ‘survey’, ‘percent’, ‘respondent’, ‘increasing’, ‘annual’, ‘budget’, ‘police’, ‘order’, ‘terrorism’, ‘survey’, ‘percent’, ‘respondent’, ‘civil’, ‘society’, ‘including’, ‘medium’, ‘ngo’, ‘religious’, ‘organization’, ‘short’, ‘raising’, ‘awareness’, ‘terrorism’, ‘police’, ‘successful’, ‘curbing’, ‘terrorism’, ‘people’, ‘civil’, ‘society’, ‘good’, ‘raise’, ‘awareness’, ‘police’, ‘ardian’]
Maybe ‘text’ is not quite right term right now, because the ‘text’ has been chunk into its unit form. But, it still the text we know. It has been peeled here and there, but it still maintains its word sequence. It is useful for edges weight definition. And then, the distinct words of list above will be the vocabulary (the vertices).
After constructing the graph and counting the vertices scores, it resulted the top 13 relevance words to the text. They are:
As you can see, the model score the word relevance well. The result is what we really hope that the word ‘terrorism’ is one of the most relevance words in the reported Indonesians worried about Surabaya’s bombings article.
The advantage of this algorithm is the learning just happens in the individual text. It doesn’t need learning in another text. Moreover, the size of the text is not a concern for this model. In the paper, they showed that applying this model on abstract has similar result with on full-text paper.
 Mihalcea, R. & Tarau, P. (2004). TextRank: Bringing Order into Texts. Proceedings 2004 Conference on Empirical Methods in Natural Language Processing.