• Home
  • Posts
  • Home
  • Posts

Analysing online review data – Part 2

May 9, 2019 Python Text Analysis

Previously, we developed a module to take care of getting the review data from Tripadvisor or Yelp in a DataFrame format. Now we want to do some analysis on this data. In this part of the series, we will do some topic modeling using Latent Dirichlet Allocation (LDA) and create a word cloud.

Steps

  1. Imports
  2. Silence Deprecation Warnings
  3. Get The Review Data
  4. Get A List Of English Stop Words
  5. Clean And Generate List Representations Of The Reviews
  6. Look for Bigrams and Trigrams
  7. Prepare For The LDA Model
  8. Apply And Visualise The LDA Model
  9. Create A Word Cloud Of The LDA Model
  10. Putting It All Together
  11. Supplementary Material

0-Requirements

import platform
print('Python version: {}'.format(platform.python_version()))
Python version: 3.6.4 

1-Imports

First we will import the required libraries that will be useful during the analysis. We will also import the module developed in Part 1. The module can also be found here. Simply take the WebScraper.py file and import it into a project as

import WebScraper

We import the rest of the libraries we need

import nltk # For getting stopwords
nltk.download('stopwords') # Only needs to be run once on the machine
import numpy as np # Not required but may be useful
import pandas as pd # For DataFrames
import gensim # For LDA and finding Bigrams and Trigrams
import WebScraper # The module from Part 1
from wordcloud import WordCloud # For generating a word cloud

import matplotlib.pyplot as plt # General plotting

import pyLDAvis # For visualising the Topics
import pyLDAvis.gensim # For visualising the Topics

import warnings # So that we can override the deprecation warning

2-Silence Deprecation Warnings

It turns out that the pyLDAvis gives a deprecation warning which is repeated probably because of a loop within the library. It might not be visually pleasing to have hundreds of deprecation warnings displayed. This is what a deprecation warning looks like

warnings.warn('hey',category = DeprecationWarning)
This is a Deprecation Warning

We can specify to ignore this category of warnings

warnings.filterwarnings("ignore",category=DeprecationWarning)

warnings.warn('hey',category=UserWarning)
warnings.warn('hey',category=DeprecationWarning)
This is a User Warning. The Deprecation Warning is now silenced

3-Get The Review Data

This is where we use the module created in Part 1. Since we have imported the module into the project, we can create the WebScraper object and gather the data in only a few lines

# Define the urls to the site of interest
url1 = "https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews"
url2 = "-The_House_of_Dionysus-Paphos_Paphos_District.html"

# Create the WebScraper object
ms = WebScraper.WebScraper(site='tripadvisor',url1=url1,
                      url2=url2,increment_string1="-or",increment_string2="",
                      total_pages=20,increment=10,silent=False)

# Get the review data from all the pages
ms.fullscraper()

# Store the review data
review_data = ms.all_reviews
The progress from the web scraping activity

We can now view the review information in DataFrame form

review_data.head()
The DataFrame stored on the WebScraper object in the WebScraper module after the scraping activity is performed

4-Get A List Of English Stop Words

Stop words are defined as a bunch of useless words giving little to no information about a piece of text in relation to the investigation we are carrying out. These are usually very common words such as ‘a’, ‘the’, ‘and’ and so on in the English language. The NLTK library imported, already has a list of stop words for the English language for us to use. This makes analysing text in our ‘fullreview’ column much cleaner and easier. Let’s obtain a list of stop words and display the first 10

stopwords = nltk.corpus.stopwords.words('english')
stopwords[0:10]
 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"] 

5-Clean And Generate List Representations Of The Reviews

We will use the gensim library to clean the ‘fullreview’ column of our DataFrame by removing punctuations. A quick example shows how this can be done using the gensim.utils.simple_preprocess method to clean and represent a review in the form of a list

gensim.utils.simple_preprocess("Where's my dog?",deacc = True)
 ['where', 'my', 'dog'] 

The deacc = True option implements the gensim.utils.deaccent method as well in order to remove accent characters. Here’s an example of the deaccent method taken from the documentation

gensim.utils.deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek")
 'Sef chomutovskych komunistu dostal postou bily prasek' 

Let’s see how this works on a particular review from our DataFrame. The review we want to clean is

review_data.iloc[0,0]
 'This is awesome and is everything books etc say about the place.  Allow around 2 hours to get around everything and the park.  Take plenty of water and wear a hat!' 

After cleaning this review, we have

gensim.utils.simple_preprocess(review_data.iloc[0,0],deacc = True)
 ['this',  'is',  'awesome',  'and',  'is',  'everything',  'books',  'etc',  'say',  'about',  'the',  'place',  'allow',  'around',  'hours',  'to',  'get',  'around',  'everything',  'and',  'the',  'park',  'take',  'plenty',  'of',  'water',  'and',  'wear',  'hat'] 

We can then remove stopwords from this review/document (notice the removal of ‘this’, ‘is’, ‘and’ and similar words.)

[word for word in gensim.utils.simple_preprocess(review_data.iloc[0,0],deacc = True) if word not in stopwords]
['awesome',  'everything',  'books',  'etc',  'say',  'place',  'allow',  'around',  'hours',  'get',  'around',  'everything',  'park',  'take',  'plenty',  'water',  'wear',  'hat'] 

Let’s create a function to package this up for us

def cleanDocument(x, stopwords):
    return [word for word in gensim.utils.simple_preprocess(x,deacc = True) if word not in stopwords]

We can then create a new column for the list representation of the document for each document in our DataFrame. Let’s call the column ‘List’

review_data['List'] = review_data['fullreview'].apply(lambda x: cleanDocument(x,stopwords))

review_data.head()
The gensim.utils.simple_preprocess method is applied to the fullreview column to produce the List column

Finally, we have a column which is a list representation of each review with punctuation, accents and stop words removed.

6-Look for Bigrams and Trigrams

Bigrams are pairs of words that often occur together. Similarly, Trigrams a 3 words frequently occurring together. This can be extended to a larger number of words occurring together (n-grams). Some examples include ‘New York’, ‘Text Analysis’, ‘European Union’, ‘bear in mind’ and so on. We can catch these n-grams in a particular text using gensim.models.Phrases

# Create bigrams
bigrams = gensim.models.Phrases(review_data['List'], min_count=3, threshold=50)
bigrams_Phrases= gensim.models.phrases.Phraser(bigrams)

# Create trigrams
trigrams = gensim.models.Phrases(bigrams_Phrases[list(review_data['List'])], min_count=3, threshold=50) 
trigram_Phrases = gensim.models.phrases.Phraser(trigrams)

The min_count argument specifies the minimum number of times this bigram (trigram) should appear before it is accepted as a bigram (trigram). The threshold determines how difficult it is to be classified as a bigram. To give a clearer example, suppose we have the following list of reviews

a = []

a.append('new york is amazing')
a.append("Yeah I know, it's all about new york")
a.append("What about the tower in new york?")
a.append("new york is the place to be apparently")
a.append("Some more words and new york some more words")
a.append("I loved the show")
a.append("specially in new york")

We first apply the clean function defined above

a_list = list(map(lambda x: cleanDocument(x,stopwords),a))

We then find the bigrams

bigram = gensim.models.Phrases(a_list,min_count=1, threshold=1)
bigram_phraser = gensim.models.phrases.Phraser(bigram)

Now we can use bigram_phraser to find bigrams in a particular text

# Clean the review
aReview = cleanDocument('Is new york the best place or what?',stopwords)

# Apply bigrams
print(bigram_phraser[aReview])
 ['new_york', 'best', 'place'] 

We successfully identified new york as a bigram. We can write a function to do this work for us

def createGrams(ls):
    """
    This function expects a list (or series) of lists of words each being a list representation of a document.
    It returns a list of bigrams and a list of Trigrams relevant to the list given.
    """
    # Create bigrams
    bigrams = gensim.models.Phrases(ls, min_count=3, threshold=50)
    bigrams_Phrases= gensim.models.phrases.Phraser(bigrams)

    # Create trigrams
    trigrams = gensim.models.Phrases(bigrams_Phrases[list(ls)], min_count=3, threshold=50) 
    trigram_Phrases = gensim.models.phrases.Phraser(trigrams)
    
    return [bigram_phraser[i] for i in list(ls)],[trigram_Phrases[i] for i in list(ls)]

We can then simply pass the ‘List’ column of our DataFrame to this function

createGrams(review_data['List'])
 ([['awesome',    'everything',    'books',    'etc',    'say',    'place',    'allow',    'around',    'hours',    'get',    'around',    'everything',    'park',    'take',    'plenty',    'water',    'wear',    'hat',    'spectacular'],   ['increadible',    'mosiacs',    'large',    'site', ...  
... , 'people',    'would',    'lived',    'centuries',    'ago',    'much',    'see',    'mosaics',    'particular',    'interest',    'whole',    'experience',    'enhanced',    'spring',    'flowers',    'stepping',    'back',    'time']]) 

While we’re at it, we might as well combine the createGrams function with the cleanDocument function into another function

def cleanAndCreateGrams(ls,stopwords):
    return(createGrams(ls.apply(lambda x: cleanDocument(x,stopwords)))[0])

and create a new column with this applied to it

review_data['GramList'] = cleanAndCreateGrams(review_data['fullreview'],stopwords)

review_data.head()
The DataFrame after cleaning, conversion to list, removal of stop words and detection of bigrams and trigrams have been applied

7-Prepare For The LDA Model

Prior to this section, the preparation of the text was related to how we can clean the documents and transform them into list representations. The LDA model we will be using in the next section as part of the gensim package expects a corpus list and an id2word dictionary. To create the id2word dictionary, we use the gensim.corpora.Dictionary method which takes a list of documents in list representation (our ‘GramList’ column we created above) and returns a dictionary where each word gets a number as a unique key and a value corresponding to how many times it appears in a particular document. To create a corpus where a word has the same id over all the documents we can use the id2word.doc2bow which does what we’re looking for

# Create Dictionary
id2word = gensim.corpora.Dictionary(review_data['GramList'])

# Create Corpus
texts = review_data['GramList']

# Term Frequency in Document
corpus = [id2word.doc2bow(text) for text in texts]

id2word is a dictionary and bow stands for Bag Of Words. We can get the frequency of each word in id2word by using its index representation

print(f"The frequency of '{id2word[1]}' is {corpus[0][1][1]}")
The frequency of 'around' is 2 

and here is an entire document in id, frequency representation

texts[100]
['great',  'place',  'walk',  'around',  'see',  'fantastical',  'well',  'preserved',  'floor',  'mosaics',  'little',  'shade',  'avoid',  'midday',  'hot',  'small',  'vending',  'machine',  'area',  'drinks',  'hours',  'saw',  'area',  'fantastic',  'mosaics'] 
id2word.doc2bow(texts[100])
[(1, 1),  (8, 1),  (10, 1),  (25, 1),  (32, 1),  (47, 2),  (175, 1),  (184, 1),  (203, 1),  (233, 1),  (258, 2),  (277, 1),  (299, 1),  (493, 1),  (532, 1),  (568, 1),  (627, 1),  (693, 1),  (694, 1),  (695, 1),  (696, 1),  (697, 1),  (698, 1)] 

We can transform the ids back into the original words

# from id to word
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:2]]
[[('allow', 1),   ('around', 2),   ('awesome', 1),   ('books', 1),   ('etc', 1),   ('everything', 2),   ('get', 1),   ('hat', 1),   ('hours', 1),   ('park', 1),   ('place', 1),   ('plenty', 1),   ('say', 1),   ('spectacular', 1),   ('take', 1),   ('water', 1),   ('wear', 1)],  [('along', 1),   ('also', 1),   ('anyone', 1),   ('bargain', 1),   ('beach', 1),   ('coral', 1),   ('cost', 1),   ('euro', 1),   ('great', 1),   ('increadible', 1),   ('large', 1),   ('mosiacs', 1),   ('must', 1),   ('paphos', 2),   ('reccoment', 1),   ('see', 1),   ('site', 1),   ('towards', 1),   ('views', 1),   ('visiting', 1),   ('would', 1)]] 

The above shows 2 reviews/documents.

8-Apply The LDA Model

LDA assumes that each document is composed of a collection of topics with varying probabilities and that each topic is a collection of words with varying probabilities. Now that we have the requirements for running the LDA model (the id2word dictionary and the corpus), let’s go ahead and apply the gensim.models.ldamodel.LdaModel method

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=4, 
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

Strictly speaking, the corpus used above is the training data the LDA model will use to estimate the parameters of the Dirichlet distribution inherent in the model. Here, we have specified the number of topics to be 4. This is the number of topics the model will be looking to extract. The per_word_topics specifies that we want a list of the most likely topics for each word.

Due to the way LDA is implemented using prior distributions (in this case it is possible to specify the hyperparameters alpha and eta), the model can be updated with further training data using the lda_model.update method. We can print the key words in the 4 topics to see how much weighting each word contributes to a topic

# Print the Keyword in the 4 topics
print(lda_model.print_topics())
[(0, '0.008*"wonderful" + 0.007*"real" + 0.007*"site" + 0.007*"especially" + 0.007*"restored" + 0.007*"visit" + 0.006*"around" + 0.006*"visited" + 0.006*"work" + 0.006*"first"'), (1, '0.027*"mosaics" + 0.022*"site" + 0.014*"ruins" + 0.013*"see" + 0.013*"well" + 0.013*"roman" + 0.012*"interesting" + 0.011*"park" + 0.010*"buildings" + 0.010*"visit"'), (2, '0.021*"mosaics" + 0.019*"see" + 0.017*"history" + 0.016*"must" + 0.016*"good" + 0.016*"site" + 0.016*"well" + 0.016*"place" + 0.016*"visit" + 0.015*"worth"'), (3, '0.039*"mosaics" + 0.028*"visit" + 0.020*"interesting" + 0.018*"paphos" + 0.018*"park" + 0.017*"site" + 0.017*"see" + 0.016*"archaeological" + 0.016*"well" + 0.015*"place"')] 

Once way to assess the performance of the model (in particular whether we have chosen the correct number of topics) is to use the coherence score

# Compute Coherence Score
coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, texts=review_data['GramList'], dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: {}'.format(coherence_lda))
Coherence Score:  0.2894762136456679 

Since we have a way of determining the performance of the model, we can loop through all possible number of topics and choose the one with the best coherence score

max_coherence_score = 0
best_n_topics = -1
best_model = None
for i in range(2,6): 
    # Build LDA model
    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                               id2word=id2word,
                                               num_topics=i, 
                                               chunksize=100,
                                               alpha='auto',
                                               per_word_topics=True)
    # Compute Coherence Score
    coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, texts=review_data['GramList'], dictionary=id2word, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    
    if max_coherence_score < coherence_lda:
        max_coherence_score = coherence_lda
        best_n_topics = i
        best_model = lda_model
        
    print('\n The Coherence Score with {} topics is {}'.format(i,coherence_lda))
The Coherence Score with 2 topics is 0.2323484892131155  

The Coherence Score with 3 topics is 0.21912753596868087  

The Coherence Score with 4 topics is 0.250791284519631  

The Coherence Score with 5 topics is 0.2402133610574262 

Now that we have the best LDA model, let’s visualise it

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(best_model, corpus, id2word)
vis
This is a screenshot from an interactive visualisation thanks to the pyLDAvis library. Each circle represents a topic and selecting a topic diplays the most important words that make up that topic

Let’s consolidate all of this into a function

def ldaModel(x):
    # Create Dictionary
    id2word = gensim.corpora.Dictionary(x)

    # Create Corpus
    texts = x

    # Term Document Frequency
    corpus = [id2word.doc2bow(text) for text in texts]

    max_coherence_score = 0
    best_n_topics = -1
    best_model = None
    for i in range(2,6): 
        # Build LDA model
        lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                                   id2word=id2word,
                                                   num_topics=i, 
                                                   random_state=100,
                                                   update_every=1,
                                                   chunksize=100,
                                                   passes=10,
                                                   alpha='auto',
                                                   per_word_topics=True)
        # Compute Coherence Score
        coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, texts=review_data['GramList'], dictionary=id2word, coherence='c_v')
        coherence_lda = coherence_model_lda.get_coherence()

        if max_coherence_score < coherence_lda:
            max_coherence_score = coherence_lda
            best_n_topics = i
            best_model = lda_model

        print('\n The Coherence Score with {} topics is {}'.format(i,coherence_lda))

    # Visualize the topics
    pyLDAvis.enable_notebook()
    vis = pyLDAvis.gensim.prepare(best_model, corpus, id2word)

    return best_model, vis

Better yet, let’s create a function to do that cleaning as well as the lda model

def ldaFromReviews(x, stopwords):
    cleanedReviewsAsLists = cleanAndCreateGrams(x,stopwords)
    return ldaModel(cleanedReviewsAsLists)

And this is how we would apply it

model,ldavis = ldaFromReviews(review_data['fullreview'],stopwords)
The Coherence Score with 2 topics is 0.21609545591563656  

The Coherence Score with 3 topics is 0.3024847517693075  

The Coherence Score with 4 topics is 0.2894762136456679  

The Coherence Score with 5 topics is 0.2692166142993114 

model now contains our trained LDA model and ldavis contains the visualisation.

9-Create A Word Cloud Of The LDA Model

Another way to represent the popular words in a corpus is by way of the word cloud. Let’s create one big dictionary with the word and the frequency throughout the corpus

freq_dict = []
[freq_dict.extend(i) for i in corpus[:]]

frequency_dict = dict()
for i,j in freq_dict:
    key = id2word[i]
    if key in frequency_dict:
        frequency_dict[key] += j
    else:
        frequency_dict[key] = j

Now we can use the wordcloud library to visualise the words and their prevalence

wordcloud = WordCloud(background_color = 'white',
                          relative_scaling = 1.0
                          ).generate_from_frequencies(frequency_dict)

wordcloud.to_image()

And of course we can consolidate this in to a function which does this for us

def generate_wordcloud_from_freq(frequency_dict):
    """A function to create a wordcloud according to the text frequencies as well as the text itself"""
    wordcloud = WordCloud(background_color = 'white',
                          relative_scaling = 1.0
                          ).generate_from_frequencies(frequency_dict)

    return wordcloud

def generate_wordcloud(freq_dict,id2word,corpus):
    freq_dict = []
    [freq_dict.extend(i) for i in corpus[:]]

    
    frequency_dict = dict()
    for i,j in freq_dict:
        key = id2word[i]
        if key in frequency_dict:
            frequency_dict[key] += j
        else:
            frequency_dict[key] = j
            
    return generate_wordcloud_from_freq(frequency_dict)

All we have to do then is to call this function

wc = generate_wordcloud(freq_dict,id2word,corpus)

wc.to_image()

10-Putting It All Together

At the bottom of this article is a module which incorporates what we’ve seen above (can also be found here at github: https://github.com/TanselArif-21/Topic-Modeling ). Here’s a demonstration of the usage:

The main page we’re interested in is https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews-The_House_of_Dionysus-Paphos_Paphos_District.html

We first import the WebScraper.py module from here or from Part 1 into our script.

import WebScraper

If we click on the next page on Tripadvisor for this url, we see a pattern. The url of the next page is

https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews-or10-The_House_of_Dionysus-Paphos_Paphos_District.html

We immediately see 4 parts to the url:

  1. https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews
  2. -or
  3. 10
  4. -The_House_of_Dionysus-Paphos_Paphos_District.html

We can utilise the WebScraper to increment 20 pages (that’s 200 reviews) with an increment of 10 at a time

url1 = "https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews"
url2 = "-The_House_of_Dionysus-Paphos_Paphos_District.html"
increment_string1="-or"
total_pages=20
increment=10
 
myScraper = WebScraper.WebScraper(site='tripadvisor',url1=url1,url2=url2,
increment_string1=increment_string1,
total_pages=total_pages,increment=increment,silent=False)
 
ms.fullscraper()

review_data = ms.all_reviews

Now we can import the TopicModeler from here or using the code in the next section

import TopicModeling

Then we can create the TopicModeling object and the visualisations

myTopicModel = TopicModeling.TopicModeling(review_data)
myTopicModel.ldaFromReviews()
myTopicModel.generate_wordcloud()
The Coherence Score with 2 topics is 0.17248626808884526  

The Coherence Score with 3 topics is 0.19218773360003868  

The Coherence Score with 4 topics is 0.2017723166654421  

The Coherence Score with 5 topics is 0.22960310254410485 

The visualisation concerning different topics can be obtained with myTopicModel.ldavis and the wordcloud can be visualised with the method myTopicModel.showWordCloud()

11-Supplementary Material

import nltk
nltk.download('stopwords')
import numpy as np
import pandas as pd
import gensim
from wordcloud import WordCloud
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
import warnings

print('Fitering Deprecation Warnings!')
warnings.filterwarnings("ignore",category=DeprecationWarning)

class TopicModeling:
    '''
    This class can be used to carry out LDA and generate word clouds
    for visualisation.

    Example Usage (review_data is a dataframe with a column called 'fullreview'):
    import TopicModeling
    myTopicModel = TopicModeling.TopicModeling(review_data)
    myTopicModel.ldaFromReviews()
    myTopicModel.generate_wordcloud()
    '''

    def __init__(self, df, review_column = 'fullreview'):
        '''
        Constructure.
        :param df: this is a dataframe with a column containing reviews
        :param review_column: the name of the review column in the passed in df
        '''

        # Get the stopwords
        self.stopwords = nltk.corpus.stopwords.words('english')

        # Attach a copy of the dataframe to this object
        self.df = df.copy()

        # Save the column name to be used for the reviews
        self.review_column = review_column

        # This will be the corpus
        self.corpus = None

        # This will be the ids of the words
        self.id2word = None

    def cleanDocument(self, x):
        '''
        This method takes a document (single review), cleans it and turns
        it in to a list of words
        :param x: a document (review) as a string
        '''

        return [word for word in gensim.utils.simple_preprocess(x,deacc = True)
                if word not in self.stopwords]

    def createGrams(self, ls):
        """
        This method expects a list (or series) of lists of words each being a
        list representation of a document. It returns a list of bigrams and
        a list of Trigrams relevant to the list given.
        :param ls: a list (or series) of a list of words
        """
        
        # Create bigrams (i.e. train the bigrams)
        bigrams = gensim.models.Phrases(ls, min_count=3, threshold=50)
        bigrams_Phrases= gensim.models.phrases.Phraser(bigrams)

        # Create trigrams (i.e. train the trigrams)
        trigrams = gensim.models.Phrases(bigrams_Phrases[list(ls)], min_count=3, threshold=50) 
        trigram_Phrases = gensim.models.phrases.Phraser(trigrams)

        # Return each document's list representation while considering n-grams
        return [bigrams_Phrases[i] for i in list(ls)],[trigram_Phrases[i] for i in list(ls)]

    def cleanAndCreateGrams(self, ls):
        '''
        This method takes a list (or series) of list representations of documents and cleans each
        one while finding n-grams (bigrams and trigrams)
        :param ls: a list (or series) of a list of words
        '''
        
        return(self.createGrams(ls.apply(lambda x: self.cleanDocument(x)))[0])

    def prepdf(self):
        '''
        This method prepares the review dataframe attached to this object by cleaning
        each review and transforming it into list representation
        '''
        
        self.df['prepped'] = self.cleanAndCreateGrams(self.df[self.review_column])

    def ldaModel(self, x = None):
        '''
        This method runs the LDA model on the column containing the reviews in list
        representation. If the reviews column has not already been prepared, this
        method will prepare it. Optionally, the user can feed in an already prepped
        column to run LDA on.
        :param x: a list of lists of words. Each list is expected to have been prepped
        by removing stopwords and finding n-grams

        :returns: a tuple of the best lda model and the visualisation
        '''

        # if x hasn't been provided, use the prepped column of the dataframe attached to this object
        if x is None:

            # if this dataframe has not been prepared, prepare it
            if 'prepped' in self.df.columns:
                x = self.df['prepped']
            else:
                self.prepdf()

        # Create Dictionary
        self.id2word = gensim.corpora.Dictionary(x)

        # Term Document Frequency
        self.corpus = [self.id2word.doc2bow(text) for text in x]

        # These are to store the performance and the best model
        max_coherence_score = 0
        best_n_topics = -1
        best_model = None

        # Loop through each topic number and check if it has improved the performance
        for i in range(2,6): 
            # Build LDA model
            lda_model = gensim.models.ldamodel.LdaModel(corpus=self.corpus,
                                                       id2word=self.id2word,
                                                       num_topics=i, 
                                                       random_state=100,
                                                       update_every=1,
                                                       chunksize=100,
                                                       passes=10,
                                                       alpha='auto',
                                                       per_word_topics=True)
            
            # Calculate Coherence Score
            coherence_model_lda = gensim.models.CoherenceModel(model=lda_model,
                    texts=x, dictionary=self.id2word, coherence='c_v')
            coherence_lda = coherence_model_lda.get_coherence()

            # If this has the best coherence score so far, save it
            if max_coherence_score < coherence_lda:
                max_coherence_score = coherence_lda
                best_n_topics = i
                best_model = lda_model

            # Print progress
            print('\n The Coherence Score with {} topics is {}'.format(i,coherence_lda))

        # Visualize the topics
        pyLDAvis.enable_notebook()
        vis = pyLDAvis.gensim.prepare(best_model, self.corpus, self.id2word)

        return best_model, vis

    def ldaFromReviews(self):
        '''
        A method to run the LDA model on the reviews dataframe. If the dataframe
        has been prepared for the LDA already, the model is directly run. Otherwise
        the dataframe is prepared first. The resulting model and visualisation is
        attached to this object.
        '''

        # If the dataframe hasn't yet been prepped, prep it
        if 'prepped' not in self.df.columns:
            self.prepdf()

        # Save the model and the visualisation to this object    
        self.ldamodel,self.ldavis = self.ldaModel()

    def generate_wordcloud_from_freq(self): 
        """
        A method to create a wordcloud according to the text frequencies
        attached to this object. Takes into account the stopwords variable
        of this object.
        """
        
        wordcloud = WordCloud(background_color = 'white',
                              relative_scaling = 1.0,
                              stopwords = self.stopwords
                              ).generate_from_frequencies(self.frequency_dict)

        return wordcloud

    def generate_wordcloud(self):
        '''
        This method gets the frequency dictionary from the corpus that
        has already been formed and creates a wordcloud. The corpus is
        an id-frequency list for each document. The resulting frequency
        dictionary is an id-frequency list for the entire corpus.
        '''

        # If there isn't a corpus, run lda
        if self.corpus is None:
            self.ldaFromReviews()

        # Get a frequncy list or tuples for each document in the corpus
        self.freq_list = []
        [self.freq_list.extend(i) for i in self.corpus[:]]

        # Now create a single dictionary with id-frequency key value pairs for all docs
        self.frequency_dict = dict()
        for i,j in self.freq_list:
            key = self.id2word[i]
            if key in self.frequency_dict:
                self.frequency_dict[key] += j
            else:
                self.frequency_dict[key] = j

        # Save wordcloud to the object        
        self.wordCloud = self.generate_wordcloud_from_freq()

    def showWordCloud(self):
        '''
        A method to display the wordcloud
        '''
        return self.wordCloud.to_image()
    

TaggedLatent Dirichlet AllocationLDAText Analytics

Analysing online review data - Part 1

Ridge Regression

My Profile Links

  • LinkedIn
  • GitHub

Recent Posts

  • Time Series Analysis Part 5 – Oxford Temperature
  • Time Series Analysis Part 4 – Multiple Linear Regression
  • Time Series Analysis Part 3 – Assessing Model Fit
  • Time Series Analysis Part 2 – Forecasting
  • Time Series Analysis 1 – Identifying Structure

Recent Comments

    Archives

    • June 2022
    • May 2022
    • April 2022
    • January 2022
    • June 2020
    • February 2020
    • July 2019
    • May 2019
    • April 2019

    Categories

    • Classification
    • Convolutional Neural Networks
    • Difference Equations
    • Image Classification
    • Linear Regression
    • Mathematics
    • Neural Networks
    • Python
    • Regression
    • Ridge Regression
    • Statistics
    • Text Analysis
    • Time Series
    • Web Scraping

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org

    Profiles

    • LinkedIn
    • GitHub

    Categories