Text Mining

Prof. Gianluca Moro, Eng. Nicola Piscaglia
Module of the Master in Data Science
of The Bologna Business School

Module Contents

The text mining module focuses on knowledge discovering from large corpora of unstructured text which is fundamental to deal with several natural language processing tasks (NLP), such as text representation models, text indexing and classification, named entity extraction, analysis of topics, semantic similarity search, explaining phenomena of interest (a.k.a. descriptive text mining), sentiment analysis and opinion mining, text summarisation, chatbot and digital assistant design etc.

The learning outcomes of the module are the capabilities of defining and implementing text mining task, from text processing and representation with traditional approaches and then with novel neural language models, up to the knowledge discovery with data science methods and machine & deep learning algorithms from several sources, such as tweets, facebook posts, reviews, web pages, emails, loan requests, legal cases, news and documents in general. The module introduces non-contextual language models based on word embeddings, such as GloVe and word2vec, and memory-based neural networks particularly effective for textual data, such as recurrent neural networks like LSTM, GRU and BiLSTM, up to an introduction to the attention mechanism with the transformer architecture for contextual word embeddings based on BERT applied to text classification, summarization and translation. Last but not least, the unit illustrates the transfer learning paradigm to exploit and fine tune existing models in target domains which are semantically different from their training source domains; this is particularly useful in order to overcome the lack of labeled data in the target domain.

The laboratory activities - which are carried out with WEKA, R and Python mainly using Google Colab - regard the following case studies:

  • in the context of technical reports on aircraft accidents, understanding from the unstructured documents the reasons that contribute to causing serious accidents
  • classification of documents by topic with several machine learning algorithms
  • sentiment analysis and opinion mining of unlabeled text sets from twitter and labeled from tripadvisor, edmunds, amazon
  • language models, deep neural networks and transfer learning in opinion mining and sentiment analysis
  • (optional) text summarisation and applications to real legal cases with state-of-the-art deep learning solutions.

Lectures and Labs

Readings

Slides, lab materials and papers will be supplied by the teacher.

Suggested Readings:

Assessment

The exam is composed by an evaluated team work lab and by an individual lab exercise with WEKA, both assigned by the teacher on one or more topics included in the syllabus.

Class timetable

Syllabus

Online Text Sets

Here is a non exhaustive list of text sets to optionally make practice with:

Contact and student meeting

Prof. Gianluca Moro
16:30-18:00 each working day; please send an email to schedule a meeting