I hold a Master's degree in Financial and Actuarial Mathematics from the Vienna University of Technology (graduation with highest distinction) and have been working as a Quantitative Credit Risk Manager within an international banking group in Austria for more than 4 years. In 2020, I joined the National Bank of Austria as an Examiner for IRB credit risk models of significant financial institutions. I have excellent programming and database (Python, R, SAS, SQL, VBA) as well as statistical/mathematical skills and gained hands-on machine learning experience during various Data Science Hackathons (amongst others: Winner of the ‘Coding Challenge on Risk Management‘ hosted by the European Central Bank).
For this project I have scraped (see my scraping tutorial here) news articles from Reuters and clustered them using Latent Dirichlet Allocation (LDA).
Since there is no proper classification of articles on Reuters, my model supports the reader in finding articles relating to a specific topic. Furthermore, the model could be used to classify future articles as well.
I used the scraper i described here to get all available Euro-Zone news from Reuters.
The data can be downloaded from here.
import logging
logging.captureWarnings(True)
# Data Processing
from collections import Counter
import datetime as dt
from dateutil.relativedelta import relativedelta
import numpy as np
import pandas as pd
# Plotting
import matplotlib.pyplot as plt
%matplotlib inline
import pyLDAvis
import pyLDAvis.sklearn
from wordcloud import WordCloud
# NLP
import gensim
from nltk.stem import SnowballStemmer
from nltk.util import ngrams
import spacy
# Modeling
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
In this step i read the data into a pandas dataframe, clean missing values and convert dates stored in strings to actual dates.
def read_data_from_csv(path_to_data, sep=";"):
df = pd.read_csv(path_to_data, sep=sep)
print(f"shape of df: {df.shape}")
print(f"columns in df: {list(df.columns)}")
return df
def drop_rows_with_missing_values(df):
rows_orig = df.shape[0]
df = df.dropna().reset_index(drop=True)
print(f"{rows_orig-df.shape[0]} rows dropped")
print(f"new shape of df: {df.shape}")
return df
def convert_date_ReutersNews(df, date_col):
df = df[df.scraper == "reuters_news"].copy()
month_map = {
"January":1, "February":2, "March":3,
"April":4, "May":5, "June":6,
"July":7, "August":8, "September":9,
"October":10, "November":11, "December":12
}
regex_date = r"([A-Z]\w+) (\d+), (\d+)"
mdy = df[date_col].str.extract(regex_date)
df["month"] = mdy[0].map(month_map)
df["day"] = mdy[1]
df["year"] = mdy[2]
df[date_col] = pd.to_datetime(df[['year', 'month', 'day']])
df = df.drop(['year', 'month', 'day'], axis=1).reset_index(drop=True)
return df
def convert_dates(df, date_col):
df_Reuters = convert_date_ReutersNews(df, date_col)
df_full = pd.concat([df_Reuters])
print(f"shape of df: {df_full.shape}")
print(f"columns in df: {list(df_full.columns)}")
return df_full
path_data = "../data/20200329_reuters_news_euro_zone.csv"
df = read_data_from_csv(path_data)
shape of df: (9773, 5)
columns in df: ['scraper', 'url', 'date', 'title', 'text']
df[["date", "title", "text"]].head(3)
date | title | text | |
---|---|---|---|
0 | July 24, 2018 / 8:08 PM / in 3 months | Reuters Leadership | Jane Barrett Rob Cox... |
1 | March 28, 2020 / 4:15 PM / Updated 14 hours ago | Coronavirus death toll in Italy's Lombardy ris... | MILAN (Reuters) - The death toll from an outbr... |
2 | March 28, 2020 / 10:01 PM / Updated 10 hours ago | Northern Ireland brings in tough measures to f... | BELFAST (Reuters) - Northern Ireland will intr... |
I decided to drop all rows with missing values
df.isnull().sum()
scraper 0
url 0
date 54
title 54
text 1
dtype: int64
df = drop_rows_with_missing_values(df)
55 rows dropped
new shape of df: (9718, 5)
The scraped date has the following form: ‘Month DD, YYYY …’. Since this can’t be properly used as a date, I used RegEx to parse the date from these strings.
df.date.iloc[0]
'July 24, 2018 / 8:08 PM / in 3 months'
df = convert_dates(df, "date")
shape of df: (9718, 5)
columns in df: ['scraper', 'url', 'date', 'title', 'text']
df.date.iloc[0]
Timestamp('2018-07-24 00:00:00')
In this step i looked at published articles over time and futher cleaned the data based on the graphical results. This step is mainly necessary, because over time some of the older news articles are no longer available.
def keep_only_articles_published_after(df, date_col, date_min):
rows_orig = df.shape[0]
df = df[df[date_col] >= date_min].copy().reset_index(drop=True)
print(f"{rows_orig-df.shape[0]} rows dropped")
print(f"new shape of df: {df.shape}")
return df
df_plot = df.date.value_counts().sort_index().reset_index()
df_plot.columns = ["date", "articles"]
fig = plt.figure(figsize=(20,3))
plt.hist(df_plot.date, bins=(df_plot.date.max()-df_plot.date.min()).days+1)
plt.title('Dates w/o Publications', fontsize=24)
plt.xlabel('publication date')
plt.gca().axes.get_yaxis().set_visible(False)
plt.show()
This plot clearly shows that some articles published before 09/2018 are no longer available online. The next graphic confirms this.
fig = plt.figure(figsize=(20,3))
plt.plot(df_plot.date, df_plot.articles)
plt.title('Published Articles over Time', fontsize=24)
plt.xlabel('publication date')
plt.ylabel('articles')
plt.show()
I have decided to not consider articles before September 2018 for this project.
date_min = dt.datetime(2018,9,1)
df = keep_only_articles_published_after(df, "date", date_min)
2000 rows dropped
new shape of df: (7718, 5)
After all these steps we reduced numbers of articles by 2,055 and are left with 7,718 articles over a time period of 1.5 years.
I performed the following steps for feature engineering and text-data cleaning:
def remove_shortest_longest_articles(tokens, remove_lowest, remove_highest):
orig_shape = tokens.shape
article_length = tokens.apply(lambda x: len(x))
article_length_vc = article_length.value_counts().sort_index().cumsum()/tokens.shape[0]
remove_len = article_length_vc[(article_length_vc <= remove_lowest) | (1-article_length_vc <= remove_highest)].index
tokens = tokens[~np.isin(article_length, remove_len)] # .reset_index(drop=True)
print(f"{orig_shape[0]-tokens.shape[0]} articles dropped.")
print(f"{tokens.shape[0]} articles remaining for modeling.")
return tokens
def preprocess_text(txt,
spacy_lang_obj,
min_word_length=3,
lemmatize=True,
stem=True,
ngrams_n_list=[2]):
"""
Preprocessing of text/titles
Input:
- txt: string - text/title
- spacy_lang_obj = spacy.load("en_core_web_XX")
- min_word_length: integer - keep only words with length >= min_word_length
- lemmatize: boolean - if True then lemmatize tokens
- stem: boolean - if True then stem tokens
- ngrams_n_list - list of integers, if ngrams not wanted, pass empty list []
"""
# spacy language tokenizer, tagger, parser, ner and word vectors
nlp = spacy_lang_obj
# replace special characters
special_chars = {"ö":"oe", "ä":"ae", "ü":"ue", "ß":"ss"}
for char in special_chars:
txt = txt.replace(char,special_chars[char])
# split text into tokens
tokens = gensim.utils.simple_preprocess(txt, deacc=True)
# remove pre-article information (mostly cities)
ind_reuters = tokens.index("reuters") if "reuters" in tokens else -1
if ind_reuters > 5:
ind_reuters = -1
tokens = tokens[ind_reuters+1:]
# lemmatize tokens using spacy
if lemmatize == True:
full_text = nlp(" ".join(tokens))
tokens = [token.lemma_ for token in full_text if token.pos_ in ['NOUN', 'ADJ', 'VERB', 'ADV']]
# stem tokens using SnowballStemmer from nltk
if stem == True:
stemmer = SnowballStemmer('english')
tokens = [stemmer.stem(token) for token in tokens]
# keep only tokens with length >= min_word_length
tokens = [token for token in tokens if len(token) >= min_word_length]
# remove stop-words from tokens
tokens_wo_stopwords = []
for token in tokens:
if (nlp.vocab[token].is_stop == False) & (~np.isin(token, ["reuters", "reuter"])):
tokens_wo_stopwords.append(token)
# build n-grams
ngram_list = []
if ngrams_n_list != []:
for n in ngrams_n_list:
if n > 1:
for ngram in ngrams(tokens_wo_stopwords, n):
ngram_list.append("_".join(str(i) for i in ngram))
# final list of tokens
tokens_final = tokens_wo_stopwords + ngram_list
return tokens_final
Load Spacy’s pretrained statistical model for English.
nlp = spacy.load("en_core_web_lg")
For my model I decided to
tokens_final = df.text.apply(
lambda x: preprocess_text(
txt=x,
spacy_lang_obj=nlp,
min_word_length=3,
lemmatize=True,
stem=False,
ngrams_n_list=[2]
)
)
tokens_final.head(3)
0 [death, toll, outbreak, coronavirus, northern,...
1 [introduce, sweeping, new, power, combat, spre...
2 [team, minister, win, test, coronavirus, sympt...
Name: text, dtype: object
Note that the lengths include bigrams as well
tokens_len = tokens_final.apply(lambda x: len(x))
print(f"min. publication length: {min(tokens_len)}")
print(f"max. publication length: {max(tokens_len)}")
min. publication length: 7
max. publication length: 5281
Looking at the distribution of articles-lengths yields the following (for the plot I excluded articles with >1,000 words):
fig = plt.figure(figsize=(8, 4))
tokens_len[tokens_len <= 1000].hist(bins=200)
plt.show()
fig = plt.figure(figsize=(8, 4))
tokens_len[tokens_len <= 100].hist(bins=40)
plt.show()
fig = plt.figure(figsize=(8, 4))
tokens_len[tokens_len > 1000].hist()
plt.show()
Based on the above and further analysis I have decided to exclude the top 1% of the short/longest articles. This leaves us with 7580 publications for modeling.
tokens_final = remove_shortest_longest_articles(tokens_final, 0.01, 0.01)
138 articles dropped.
7580 articles remaining for modeling.
We now have a look at the vocabulary we will later pass to the model.
# split vocabulary into uni- and bi-grams
unigrams = [word for words in tokens_final for word in words if not "_" in word]
bigrams = [word for words in tokens_final for word in words if "_" in word]
print(f"total number of words: {len(unigrams)}")
print(f"total number of bigrams: {len(bigrams)}")
total number of words: 1027221
total number of bigrams: 1019657
# unique words and bigrams in our train-dataset
unigrams_unique = sorted(list(set(unigrams)))
bigrams_unique = sorted(list(set(bigrams)))
print(f"unique words: {len(unigrams_unique)}")
print(f"unique bigrams: {len(bigrams_unique)}")
unique words: 17957
unique bigrams: 577945
most_used_unigrams = pd.DataFrame.from_dict(Counter(unigrams), orient="index").reset_index()
most_used_unigrams.columns = ["word", "occurence"]
most_used_unigrams.sort_values(by="occurence", inplace=True)
most_used_unigrams.tail(5)
word | occurence | |
---|---|---|
1029 | percent | 4685 |
135 | tell | 4708 |
620 | deal | 5514 |
61 | government | 7004 |
203 | year | 10035 |
most_used_bigrams = pd.DataFrame.from_dict(Counter(bigrams), orient="index").reset_index()
most_used_bigrams.columns = ["word", "occurence"]
most_used_bigrams.sort_values(by="occurence", inplace=True)
most_used_bigrams.tail(5)
word | occurence | |
---|---|---|
199 | news_conference | 476 |
2370 | central_bank | 588 |
3211 | year_old | 656 |
8371 | interest_rate | 776 |
598 | tell_reporter | 862 |
wordcloud_unigrams = WordCloud(
width=3000,
height=1000,
background_color="white",
min_font_size = 12
).generate(" ".join(unigrams))
fig = plt.figure(figsize=(15, 5))
plt.imshow(wordcloud_unigrams)
plt.axis("off")
plt.tight_layout(pad=0)
corpus = tokens_final.apply(lambda x: " ".join(x))
# only count words that appear in more than 15 articles
# don't count words that appear in more than 75% of articles
cv = CountVectorizer(min_df=15, max_df=0.75)
document_term_matrix = cv.fit_transform(corpus)
lda_baseline = LatentDirichletAllocation(n_jobs=-1).fit(document_term_matrix)
print("------ Baseline Model ------")
# print(f"perplexity: {round(lda_baseline.perplexity(document_term_matrix),2)}") # see comment in 6.3.2
print(f"log_likelihood: {round(lda_baseline.score(document_term_matrix),2)}")
------ Baseline Model ------
log_likelihood: -8421569.0
topic_models_ll = GridSearchCV(
estimator=LatentDirichletAllocation(n_jobs=-1),
param_grid={"n_components": list(range(10, 41, 2)),
"learning_decay": [0.5, 0.7, 0.9]},
cv=[(slice(None), slice(None))], # no cv for unsupervised ml
n_jobs=-1,
verbose=20
)
topic_models_ll.fit(document_term_matrix)
print("---------------- Best Model (Log-Likelihood) ----------------")
print(f"model parameters: {topic_models_ll.best_params_}")
# print(f"perplexity: {round(topic_models_ll.best_estimator_.perplexity(document_term_matrix),2)}") # see comment in 6.3.2
print(f"log_likelihood: {round(topic_models_ll.best_estimator_.score(document_term_matrix),2)}")
---------------- Best Model (Log-Likelihood) ----------------
model parameters: {'learning_decay': 0.5, 'n_components': 28}
log_likelihood: -8385793.37
models_ll = pd.DataFrame(topic_models_ll.cv_results_)
models_ll = models_ll[["param_learning_decay", "param_n_components", "mean_test_score"]]
models_ll.rename(columns={'mean_test_score':'log_likelihood'}, inplace=True)
models_ll.sort_values(by="log_likelihood", ascending=False, inplace=True)
# top models (log likelihood)
models_ll.head(5)
param_learning_decay | param_n_components | log_likelihood | |
---|---|---|---|
9 | 0.5 | 28 | -8.360676e+06 |
24 | 0.7 | 26 | -8.362038e+06 |
8 | 0.5 | 26 | -8.365602e+06 |
22 | 0.7 | 22 | -8.367235e+06 |
38 | 0.9 | 22 | -8.369719e+06 |
df_plot = pd.DataFrame(topic_models_ll.cv_results_)
n_topics = list(range(10,41,2))
ld5 = df_plot[(df_plot.param_learning_decay==0.5)].mean_test_score.tolist()
ld7 = df_plot[(df_plot.param_learning_decay==0.7)].mean_test_score.tolist()
ld9 = df_plot[(df_plot.param_learning_decay==0.9)].mean_test_score.tolist()
plt.figure(figsize=(13, 5))
plt.plot(n_topics, ld5, label="0.5")
plt.plot(n_topics, ld7, label="0.7")
plt.plot(n_topics, ld9, label="0.9")
plt.xlabel("n_components")
plt.ylabel("mean_test_score (log-likelihood)")
plt.legend(title='learning_decay', loc='lower right')
plt.grid()
plt.show()
Extend the LDA class in order to use perplexity as score. Currently not possible, as there seems to be a bug in sklearn.
I decided to not go with the model suggested via Grid-Search and chose the model with n_components=22 and learning_decay=0.7 instead (4th-best model). The main reason for this is that the difference to a model with more topics (28 in the best case) is not too significant.
lda_final = LatentDirichletAllocation(
n_components=22,
learning_decay=0.7,
n_jobs=-1
).fit(document_term_matrix)
I used pyLDAvis for visualization purposes. I have saved the results as HTML and they can be viewed here.
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_final, document_term_matrix, cv, mds='tsne')
pyLDAvis.save_html(panel, 'lda_final.html')
topic = np.argmax(lda_final.transform(document_term_matrix), axis=1)+1
tok_cor_top = pd.DataFrame(
data={"corpus": corpus, "tokens": tokens_final, "topic": topic},
index=corpus.index
)
df_topics = df.merge(tok_cor_top, left_index=True, right_index=True, how="inner").reset_index(drop=True)
df_topics[["date", "title", "text", "topic"]].head()
date | title | text | topic | |
---|---|---|---|---|
0 | 2020-03-28 | Coronavirus death toll in Italy's Lombardy ris... | MILAN (Reuters) - The death toll from an outbr... | 12 |
1 | 2020-03-28 | Northern Ireland brings in tough measures to f... | BELFAST (Reuters) - Northern Ireland will intr... | 12 |
2 | 2020-03-27 | British ministers won't be tested for virus un... | LONDON (Reuters) - Boris Johnson’s top team of... | 12 |
3 | 2020-03-27 | Johnson's central role to UK virus response en... | LONDON (Reuters) - People with a central role ... | 12 |
4 | 2020-03-27 | Foreigners face suspicion in China as coronavi... | SHENZHEN, China/SHANGHAI (Reuters) - Francesca... | 12 |