A Data Science Central Community
With text analytics, various burning questions around the ‘why’ and ‘what’ of a piece or group of content can be answered. Examples like social media chatter around brand can create a supremely spiraling impact (remember the post which showed a Kentucky man was violently removed from his United Airlines seat on an overbooked flight? And how it lead to a social media disaster for the airline?). Hence there need to be ways to make sense of the unstructured data from diverse sources.
This is where text analytics steps in.
Text analytics takes in such unstructured data, extracts relevant information, and structures it for further actions or decision making. In addition to social media data, other examples include e-mail messages, call center notes, and customer records. It helps extract different types of information like:
Which tools/algorithms are most popular for text analytics?
The reason why text analytics has gone mainstream is because there are more than a handful of tools and applications available today to derive immense value. Let’s have a look at a few popular ones:
This type of classifier seeks to repeatedly group data into groups or classes. It comes in handy for tasks like classification or regression. Popular algorithms in decision trees include:
This is a popular technique to classify text and documents based on a category (whether to classify a document as Sport or as Political based on the occurrence of certain words). It is a simple way to assign class or category labels to instances or cases.
Rather than being a single distinct algorithm, it is a set of algorithms that work on one underlying principle -- “the value of a given feature is independent of the value of any other feature”.
Practical applications include mark or not mark email as spam, assess a piece of content as positive or negative. It is also used in facial recognition software.
Support Vector Machines
This is a supervised machine learning algorithm. It can be applied on classification and regression problems. Its essential component is kernel trick which transforms linear data into non-linear data by replacing its features by a kernel function. It is used in hypertext categorization, classification of images, and facial recognition applications.
K Nearest Neighbors
k-NN is used is search items where you are looking for something similar. You determine similarity by creating a vector representation of the items and then compare how similar or dissimilar they are using a distance metric like Euclidean distance.
The best example of k-NN’s prowess is an e-commerce site’s product recommendation feature. You can also utilize k-NN to do Concept Search (finding semantically similar documents).
Artificial Neural Networks
ANNs are primarily utilized for non-linear boundaries- based classification. Much like the working of the human brain, ANN operates on hidden states (which correspond to the neurons in the brain). It can have the below 3 forms of algorithms to help in training ANN
Image compression, handwriting analysis, and stock exchange movement prediction are some sectors where ANN comes in useful. It examines a huge volume of information and helps make quick decisions.
This is a useful form of clustering that can add value when there are items that can be a part of more than one cluster. It works on the principle that after the clustering is over, all items in a cluster are as similar as possible to each other. Additionally, they will be as dissimilar to other items in other clusters as possible.
It comprises of the below steps (similar to k-means clustering)
Disciplines like Bioinformatics, healthcare, and economics make use of fuzzy c-means with great success. In image analysis too it overcomes the barriers to traditional k-means clustering (lot of noise, shadowing, camera variations etc.), to do better image processing.
Applying Latent Dirichlet Allocation (LDA) helps to find a linear combination of features that distinguishes or characterizes multiple classes of events or objects. A small example of how LDA helps in topic clustering is as below
Suppose there are three separate sentences
With LDA, topic clustering for these 3 lines are done as follows –
Sentence 1 = 100% Topic B
Sentence 2 = 100% Topic A
Sentence 3= 33% Topic A and 67% Topic B
Now we classify based on the words in the sentence. We can propose that there are two clusters – Pets (Topic A) and Food (Topic B).
This example finally boils down to to the below steps
This helps in ensuring coherent topic clustering.
With these tools, your text analytics objectives can be met with favorable outcomes. Do let us know which one is your favorite text analytics tool in the comments box below.
Article contributed by PromptCloud, a pioneer in large-scale and custom web data extraction services.