import tensorflow as tf print(tf. version )

In [1]:

2.0.0

In [2]:

!pip install nltk

Processing c:\users\win10\appdata\local\pip\cache\wheels\de\5e\42\64a baeca668161c3e2cecc24f864a8fc421e3d07a104fc8a51\nltk-3.5-py3-none-an y.whl

Collecting tqdm

Downloading tqdm-4.49.0-py2.py3-none-any.whl (69 kB) Collecting regex

Using cached regex-2020.7.14-cp36-cp36m-win_amd64.whl (268 kB) Collecting joblib

Using cached joblib-0.16.0-py3-none-any.whl (300 kB) Collecting click

Using cached click-7.1.2-py2.py3-none-any.whl (82 kB) Installing collected packages: tqdm, regex, joblib, click, nltk

Successfully installed click-7.1.2 joblib-0.16.0 nltk-3.5 regex-2020.

7.14 tqdm-4.49.0

import csv

import tensorflow as tf

import numpy as np

from tensorflow.keras.preprocessing.text import Tokenizer

from tensorflow.keras.preprocessing.sequence import pad_sequences

from nltk.corpus import stopwords

STOPWORDS = set(stopwords.words('english'))

In [3]:

Put the hyparameters at the top like this to make it easier to change and edit.

vocab_size = 5000

embedding_dim = 64

max_length = 200 trunc_type = 'post' padding_type = 'post' oov_tok = '<OOV>' training_portion = .8

In [4]:

First, let's define two lists that containing articles and labels. In the meantime, we remove stopwords.

articles = [] labels = []

with open("bbc-text.csv", 'r') as csvfile: reader = csv.reader(csvfile, delimiter=',') next(reader)

for row in reader: labels.append(row[0]) article = row[1]

for word in STOPWORDS:
token = ' ' + word + ' '

article = article.replace(token, ' ') article = article.replace(' ', ' ')

articles.append(article) print(len(labels)) print(len(articles))

In [5]:

2225

2225

There are only 2,225 articles in the data. Then we split into training set and validation set, according to the parameter we set earlier, 80% for training, 20% for validation.

train_size = int(len(articles) * training_portion)

train_articles = articles[0: train_size] train_labels = labels[0: train_size]

validation_articles = articles[train_size:] validation_labels = labels[train_size:]

print(train_size) print(len(train_articles)) print(len (train_labels)) print(len(validation_articles)) print(len (validation_labels))

In [6]:

1780

445

Tokenizer does all the heavy lifting for us. In our articles that it was tokenizing, it will take 5,000 most common words. oov_token is to put a special value in when an unseen word is encountered. This means I want "OOV" in bracket to be used to for words that are not in the word index. "fit_on_text" will go through all the text and create dictionary like this:

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok) tokenizer.fit_on_texts(train_articles)

word_index = tokenizer.word_index

In [7]:

You can see that "OOV" in bracket is number 1, "said" is number 2, "mr" is number 3, and so on.

In [8]:

Out[8]:

dict(list(word_index.items())[0:10])

{'<OOV>': 1,

'said': 2,

'mr': 3,

'would': 4,

'year': 5,

'also': 6,

'people': 7,

'new': 8,

'us': 9,

'one': 10}

This process cleans up our text, lowercase, and remove punctuations.

After tokenization, the next step is to turn thoes tokens into lists of sequence.

train_sequences = tokenizer.texts_to_sequences(train_articles)

In [9]:

This is the 11th article in the training data that has been turned into sequences.

In [10]:

print(train_sequences[10])

[2431, 1, 225, 4996, 22, 642, 587, 225, 4996, 1, 1, 1663, 1, 1, 2431,

22, 565, 1, 1, 140, 278, 1, 140, 278, 796, 823, 662, 2307, 1, 1144, 1

694, 1, 1721, 4997, 1, 1, 1, 1, 1, 4738, 1, 1, 122, 4514, 1, 2, 2874,

1505, 352, 4739, 1, 52, 341, 1, 352, 2172, 3962, 41, 22, 3795, 1, 1,

1, 1, 543, 1, 1, 1, 835, 631, 2366, 347, 4740, 1, 365, 22, 1, 787, 23

67, 1, 4302, 138, 10, 1, 3666, 682, 3531, 1, 22, 1, 414, 823, 662, 1,

90, 13, 633, 1, 225, 4996, 1, 599, 1, 1694, 1021, 1, 4998, 808, 1864,

117, 1, 1, 1, 2974, 22, 1, 99, 278, 1, 1608, 4999, 543, 493, 1, 1443,

4741, 778, 1320, 1, 1861, 10, 33, 642, 319, 1, 62, 478, 565, 301, 150

6,	22, 479, 1, 1, 1666, 1, 797, 1, 3066, 1, 1365,	6, 1, 2431, 565, 2
2,	2971, 4735, 1, 1, 1, 1, 1, 850, 39, 1825, 675,	297, 26, 979, 1, 88
2,	22, 361, 22, 13, 301, 1506, 1343, 374, 20, 63,	883, 1096, 4303, 24
7]

When we train neural networks for NLP, we need sequences to be in the same size, that's why we use padding. Our max_length is 200, so we use pad_sequences to make all of our articles the same length which is 200 in my example. That's why you see that the 1st article

was 426 in length, becomes 200, the 2nd article was 192 in length, becomes 200, and so on.

In [11]: train_padded = pad_sequences(train_sequences, maxlen=max_length, paddi

print(len(train_sequences[0])) print(len(train_padded[ 0]))

print(len(train_sequences[1])) print(len(train_padded[ 1]))

print(len(train_sequences[10])) print(len(train_padded[ 10]))

In [12]:

425

200

192

200

186

200

In addtion, there is padding type and truncating type, there are all "post". Means for example, for the 11th article, it was 186 in length, we padded to 200, and we padded at the end, add 14 zeros.

In [13]: