Online Courses
Free Tutorials  Go to Your University  Placement Preparation 
Artificial Intelligence(AI) & Machine Learning(ML) Training in Jaipur
Online Training - Youtube Live Class Link
0 like 0 dislike
in Artificial Intelligence(AI) & Machine Learning by Goeduhub's Expert (2.1k points)
edited by
In this article, we will learn what is word2vec and how it is important in text preprocessing to convert a text into a vector to perform various machine learning and deep learning applications. And finally Implementation of word2vec in python.

Goeduhub's Online Courses @Udemy

For Indian Students- INR 570/- || For International Students- $12.99/-


Course Name

Apply Coupon


Tensorflow 2 & Keras:Deep Learning & Artificial Intelligence

Apply Coupon


Computer Vision with OpenCV | Deep Learning CNN Projects

Apply Coupon


Complete Machine Learning & Data Science with Python Apply Coupon


Natural Language Processing-NLP with Deep Learning in Python Apply Coupon


Computer Vision OpenCV Python | YOLO| Deep Learning in Colab Apply Coupon


Complete Python Programming from scratch with Projects Apply Coupon

2 Answers

0 like 0 dislike
by Goeduhub's Expert (2.1k points)
edited by
Best answer

Prerequisite- Word2Vec Theory , NLP

Documentation- GENSIM

GENSIM- GENSIM is a opensource project to implement various models and algorithms. 

In this tutorial, we will implement word2vec embedding (family of algorithms) to a corpus. corpus- history of India from wikipedia. 

Importing Libraries and models 

#importing some important libraries 

import nltk

#importing word2vec 

from gensim.models import Word2Vec

from nltk.corpus import stopwords

import re


You, off course  aware of above imported libraries. Let's take a look to these libraries and packages

NLTK:  nltk stands for natural language toolkit is a most important NLP library to preprocess the text data (human language) to convert the data into a format, that can be used by machine to process further.

RegEx (re): A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).

Wrod2Vec: To convert a word into a vector (embedding), which can be used by machine for various purpose. (translation, image captioning  etc...)

Stopwords: Stopwords is a nltk package to ignore the words that have no meaning in a sentence but are used to understand a sentence logically. (for example, The, are, of, it, from etc....).

To use "Punkt" and "Stopwords" packages of nltk. First, we have to download them.'punkt')

Note: An error will occur, if you skip this step. Make sure to download these packages of nltk.

#loading data (corpus)

corpus = """According to consensus in modern genetics anatomically

 modern humans first arrived on the Indian subcontinent from Africa 

between 73,000 and 55,000 years ago.However, the earliest known 

human remains in South Asia date to 30,000 years ago. 

Settled life, which involves the transition from foraging to farming

 and pastoralism, began in South Asia around 7,000 BCE.

At the site of Mehrgarh, Balochistan, Pakistan, presence can be 

documented of the domestication of wheat and barley, rapidly 

followed by that of goats, sheep, and cattle.

By 4,500 BCE, settled life had spread more widely, and began to

 gradually evolve into the Indus Valley Civilization, an early 

civilization of the Old world, which was contemporaneous with 

Ancient Egypt and Mesopotamia. 

This civilisation flourished between 2,500 BCE and 1900 BCE in what

 today is Pakistan and north-western India, and was noted for its

 urban planning, baked brick houses, elaborate drainage, and water 


In early second millennium BCE persistent drought caused the 

population of the Indus Valley to scatter from large urban centres

 to villages. 

Around the same time, Indo-Aryan tribes moved into the Punjab from

 regions further northwest in several waves of migration.

The resulting Vedic period was marked by the composition of the 

Vedas, large collections of hymns of these tribes whose postulated 

religious culture, through synthesis with the preexisting religious

 cultures of the subcontinent, gave rise to Hinduism. 

The caste system, which created a hierarchy of priests, warriors, 

and free peasants, but which excluded indigenous peoples by labeling

 their occupations impure, arose later during this period.

 Towards the end of the period, around 600 BCE, after the pastoral 

and nomadic Indo-Aryans spread from the Punjab into the Gangetic 

plain, large swaths of which they deforested to pave way for 

agriculture, a second urbanisation took place. 

 The small Indo-Aryan chieftaincies, or janapadas, were 

consolidated into larger states, or mahajanapadas.

This urbanisation was accompanied by the rise of new ascetic 

movements in Greater Magadha, including Jainism and Buddhism, 

which opposed the growing influence of Brahmanism and the primacy

 of rituals, presided by Brahmin priests, that had come to be 

associated with Vedic religion,and gave rise to new religious 


Note: In this block we just loaded our data sets (history of India from wikipedia)

0 like 0 dislike
by Goeduhub's Expert (2.1k points)

# Preprocessing of raw text 

text = re.sub(r'\[[0-9]*\]',' ',corpus)

text = re.sub(r'\s+',' ',text)

text = text.lower()

text = re.sub(r'\d',' ',text)

text = re.sub(r'\s+',' ',text)

# Sentence tokenizing

sentences = nltk.sent_tokenize(text)

#word tokenizing 

sentences = [nltk.word_tokenize(sentence) 

                       for sentence in sentences]

for i in range(len(sentences)):

    sentences[i] = [word for word in sentences[i] 

                    if word not in stopwords.words('english')]

#Printing a sentence 





In the above output you can see the preprocessed data. In the above block, First, we removed all unnecessary symbols, numbers and commas from our text. 

And converted all characters to lower cast characters, this is done to reduce unnecessary vocab. 

For example, India and india both are same, but if we do not perform lower cast operation. Our vocabulary will consider both as different vocab. which is not required. 

After removing all these symbols and signs, we tokenized each sentence and then each word in text.

# Training the Word2Vec model

model = Word2Vec(sentences, min_count=1)

#printing vocab 

vocab = model.wv.vocab





Here, we passed our preprocessed data to word2vec to convert the words into vector. As you see in the output after passing the text from word2vec we printed vocab recognize by word2vec (blue underline - consensus , morden and more when you run the program).

These are the unique words recognize by word2vec from text.

min_count (int, optional) – Ignores all words with total frequency lower than this.

If you want to see vector of a word you have to run the below code. For example, I want to see vector of word (vocab) "jainism".

#Priting vector of a word or vocab 

vector = model.wv['jainism']




Note: As you see the vector generated by word2vec is fixed dimension and dense. This is what ,I mentioned in theory and solution of sparse matrix and overfitting in TF-IDF and BOW (Bag of word).

# Similar words of a word 

similar_words = model.wv.most_similar('jainism')





We know that word2vec, represent relation between words, so here is a example, In this block of code we printed similar words of word "jainism". And similar words to this word are: Religion (Jainism related to religion), bce (Before common era, originated before AD) , hymns (old songs to praise God).

Our Mentors(For AI-ML)

Sharda Godara Chaudhary

Mrs. Sharda Godara Chaudhary

An alumna of MNIT-Jaipur and ACCENTURE, Pune


Ms. Nisha

An alumna of IIT-BHU


About Us | Contact Us || Terms & Conditions | Privacy Policy || Youtube Channel || Telegram Channel © Social::   |  |