Akademia Finansów i Biznesu Vistula - Centralny System Uwierzytelniania
Strona główna

Elective subject 1:Quantitative Text Analysis in R (QTAR)

Informacje ogólne

Kod przedmiotu: MNM3SEEC-PDW1-Z18
Kod Erasmus / ISCED: (brak danych) / (brak danych)
Nazwa przedmiotu: Elective subject 1:Quantitative Text Analysis in R (QTAR)
Jednostka: Akademia Finansów i Biznesu Vistula
Grupy:
Punkty ECTS i inne: 3.00 LUB 4.00 (w zależności od programu) Podstawowe informacje o zasadach przyporządkowania punktów ECTS:
  • roczny wymiar godzinowy nakładu pracy studenta konieczny do osiągnięcia zakładanych efektów uczenia się dla danego etapu studiów wynosi 1500-1800 h, co odpowiada 60 ECTS;
  • tygodniowy wymiar godzinowy nakładu pracy studenta wynosi 45 h;
  • 1 punkt ECTS odpowiada 25-30 godzinom pracy studenta potrzebnej do osiągnięcia zakładanych efektów uczenia się;
  • tygodniowy nakład pracy studenta konieczny do osiągnięcia zakładanych efektów uczenia się pozwala uzyskać 1,5 ECTS;
  • nakład pracy potrzebny do zaliczenia przedmiotu, któremu przypisano 3 ECTS, stanowi 10% semestralnego obciążenia studenta.

zobacz reguły punktacji
Język prowadzenia: angielski
Skrócony opis: (tylko po angielsku)

The objective of the course is to introduce students to most popular methods and models used in text mining. Students will write programs in R to collect large text data (scraping, crawling of websites) and automatically analyze thousands of pages of text.

Prerequisites: Knowledge of R at introductory to intermediate level (or free online course taken prior to QTAR course), basic programming skills

Pełny opis: (tylko po angielsku)

The era of well-structured data that was fed to business analytics software is over. Nowadays vast majority of data about people, clients, companies or events is in unstructured form: text, picture or movie. Companies that analyze unstructured data (such as client opinions about their products) can operate much more efficiently than those that don’t. Financial institutions apply unstructured data analysis to predict market behavior or score clients applying for loans, governments check quality of their services, in the healthcare sector such analysis helps to provide better treatment, it is also used in politics, as Trump election- Oxford Analytica - Facebook scandal shows. This course will show how to apply various automated quantitative text analysis (also known as text mining) to analyze text data with the help of R. While the course will focus on applications in finance, economics, marketing and political science, presented text mining models are also used by academic researchers in other fields, such as: journalism, languages or history. Course will discuss models that belong to the so-called “bag-of-words” category.

Course will be case-based. After introduction of each model participants will practice a prepared case in R and then will apply the model to their own text data (could be of any type: call logs, product site comments, emails, large corpora of various reports or news etc.). Cases will be of various types, for example: checking popularity of prominent politicians in the news; analyzing brands of selected companies; automatically detecting topics covered in large number of texts; predicting future macroeconomic data based on news; analysing new technology trends. Cases can be selected on the basis of interest and background of participants. Cases can be in English, Polish or Russian.

After the completion of the course participants should be able to use R to collect text data from Internet and analyze large corpora of texts in 2-3 languages in various areas, including business and politics.

The course will consist of several parts:

1. Data acquisition.

In this part we will show how to crawl (scrape) various websites to collect large amount of text data. Various techniques will be shown, from basic scraping to simulating human behavior with usage of R selenium. As a preparation for effective crawling we will introduce Xpath and regular expressions.

2. Data cleaning and structuring

Standard techniques of cleaning text data will be introduced (such as removing punctuation and stopwords). Texts corpus will be transformed into document-term matrix and two text tokenization methods will be shown: stemming and lemmatization.

3. Supervised text mining models

These models require initial human input, such as formulating a dictionary (in the case of sentiment analysis large sentiment lexicons for Polish and Russian language will be provided) or reading some texts and indicating their position on a scale, before large corpus of texts is automatically analyzed. Examples of such models are sentiment analysis, topic coding or Wordscores.

4. Unsupervised text mining models

These models do not require human input before the analysis, all work is done by computer. However some work has to be done when interpreting the results. Examples of unsupervised learning models are Wordfish or Expressed Agenda Models.

Total: 30 hours

Literatura: (tylko po angielsku)

Text analysis in R. http://kenbenoit.net/pdfs/text_analysis_in_R.pdf

Metody i kryteria oceniania: (tylko po angielsku)

Classes will be organized as follows:

- students will self-organize into teams, on average 3 students in each team

- first I will discuss theoretical foundations of each model, then present a case in R (we will use free R Studio software)

- teams will be required to apply the learned model/method to data sets provided (or collected by students), randomly selected team(s) will make a short presentation at the beginning of the following class

- during the last class(es) all teams will present their capstone projects

Each presentation (short and final) will be graded as follows: half of the grade will be given by the class in a anonymous poll, half of the grade will be given by me.

Criteria for judging the project presentations:

- clearly and properly stated project goal

- using proper model to achieve the goal

- application of proper data science techniques and proper R code for model estimation

- correct interpretation of model results

- correct connection between model results and the project goal

- clarity and quality of presentation (slides and team performance)

- was the presentation exciting or boring

Course final grade will be calculated as follows:

Short presentation of a given model (team) – 30%

Presentation of the final capstone project (team) – 70%

Zajęcia w cyklu "Semestr zimowy 2018/2019" (zakończony)

Okres: 2018-10-01 - 2019-02-01
Wybrany podział planu:


powiększ
zobacz plan zajęć
Typ zajęć:
Wykład, 30 godzin więcej informacji
Koordynatorzy: Krzysztof Rybiński
Prowadzący grup: Krzysztof Rybiński
Lista studentów: (nie masz dostępu)
Zaliczenie: Przedmiot - Egzamin/zaliczenie na ocenę/zal w skali zal-std2
Wykład - Egzamin/zaliczenie na ocenę/zal w skali zal-std2
Opisy przedmiotów w USOS i USOSweb są chronione prawem autorskim.
Właścicielem praw autorskich jest Akademia Finansów i Biznesu Vistula.
ul. Stokłosy 3
02-787 Warszawa
tel: +48 22 45 72 300 https://vistula.edu.pl/
kontakt deklaracja dostępności USOSweb 7.0.0.0-1 (2023-09-06)