Determining Gender of a Name with 80% Accuracy Using Only Three Features

Categories Uncategorized

Introduction

I thought an easy project to learn machine learning was to guess the gender of a name using characteristics of the name. After playing around with different features by encoding characters of the name, I discovered you only needed THREE features for 80% accuracy which is pretty impressive. I am by no means an expert at machine learning, so if you see any errors, feel free to point them out.

Example:

```Name Actual Classified
shea F F
lucero F M
damiyah F F
nitya F F
sloan M M
porter F M
jalaya F F
aubry F F
mamie F F
jair M M```

Dataset

The dataset used for getting names was from SSN’s baby names dataset for the year 2014.
https://www.ssa.gov/oact/babynames/names.zip

Methodology

I took all the baby names from the dataset that had at least 20 people for male and female since I found many names were low quality when they are least used (for example, there are a few guys named Amy born in 2014).

```import numpy as np from sklearn.cross_validation
import train_test_split, cross_val_score from sklearn.ensemble
import RandomForestClassifier from sklearn
import svm my_data = np.genfromtxt('names/yob2014.txt', delimiter=',', dtype=[('name','S50'), ('gender','S1'),('count','i4')], converters={0: lambda s:s.lower()})
my_data = np.array([row for row in my_data if row[2]>=20])
name_map = np.vectorize(name_count, otypes=[np.ndarray])
Xlist = name_map(my_data['name'])
X = np.array(Xlist.tolist())
y = my_data['gender']

```

X is an np.array of N * M, where N is number of names and M is number of features
y is M or F
name_map will be a function that converts a name (string) to an array of features

Fitting and Validation

We will be splitting the data into training and testing for cross-validation and using RandomForrest for classification since it performs well at classifying data.
```for x in xrange(5):
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
clf = RandomForestClassifier(n_estimators=100, min_samples_split=2)
clf.fit(Xtr, ytr)
print np.mean(clf.predict(Xte) == yte)```
By default, RandomForest will set max_features(number of features to look at before split) = n_features which is recommended for classification problems (http://scikit-learn.org/stable/modules/ensemble.html#parameters). We will be using n_estimator (number of trees) of 100 and a min_samples_split (the minimum number of samples required to split an internal node) of 2 which we will tune when we determine a good feature set.

Picking Features

Character Frequency

My first attempt at features was the frequency of each character:
```def name_count(name):
arr = np.zeros(52)
for ind, x in enumerate(name):
arr[ord(x)-ord('a')] += 1
return arr```
Example:
aaabd
freq: [a:3, b:1, d:1]
* Note that we encode freq as an array using index of letter. e.g.: [3, 1, 0, 1, 0, 0, …. 0]. Most of the array will be zeroes.
Accuracy:
0.690232056125
0.692390717755
0.693739881274
0.688073394495
0.694819212089

Character Frequency + Order

Second attempt at features is frequency + ordering:
```def name_count(name):
arr = np.zeros(52)
for ind, x in enumerate(name):
arr[ord(x)-ord('a')] += 1
arr[ord(x)-ord('a')+26] += ind+1
return arr```
Example: aaabc
freq: [a:3, b:1, c:1]
ord: [a:6, b:4, c:5]

We can combine these encodings by adding the two arrays together and offsetting the second array

Accuracy:
0.766864543983
0.760388559093
0.766864543983
0.76740420939
0.759848893686
We are getting somewhere!

Character Frequency + Order + 2-grams

Let’s trying adding all the 2-grams in the name as features to see if we can get more info.
```def name_count(name):
arr = np.zeros(52+26*26)
# Iterate each character
for ind, x in enumerate(name):
arr[ord(x)-ord('a')] += 1
arr[ord(x)-ord('a')+26] += ind+1
# Iterate every 2 characters
for x in xrange(len(name)-1):
ind = (ord(name[x])-ord('a'))*26 + (ord(name[x+1])-ord('a'))
arr[ind] += 1
return arr```
Example: aaabc
freq: [a:3, b:1, c:1]
ord: [a:6, b:4, c:5]
2-gram: [ aa: 2, ab: 1, bc: 1]
We can encode 2-grams by converting from base 26, e.g.-> aa = 0, bc = 26 + 2 = 28

Accuracy:

```0.78548300054
0.771451699946
0.783864004317
0.777388019428
0.77172153265```

We get a slight increase in accuracy, but I think we can do better.

Character Frequency + Order + 2-grams + Heuristics

Examining the names more in depth, I hypothesized that the length of name and last and second character of the name could be important.
```def name_count(name):
arr = np.zeros(52+26*26+3)
# Iterate each character
for ind, x in enumerate(name):
arr[ord(x)-ord('a')] += 1
arr[ord(x)-ord('a')+26] += ind+1
# Iterate every 2 characters
for x in xrange(len(name)-1):
ind = (ord(name[x])-ord('a'))*26 + (ord(name[x+1])-ord('a')) + 52
arr[ind] += 1
# Last character
arr[-3] = ord(name[-1])-ord('a')
# Second Last character
arr[-2] = ord(name[-2])-ord('a')
# Length of name
arr[-1] = len(name)
return arr```
Example: aaabc
freq: [a:3, b:1, c:1]
ord: [a:6, b:4, c:5]
2-gram: [ aa: 2, ab: 1, bc: 1]
last_char: 3
second_last_char: 2
length: 5
Accuracy:
```0.801672962763
0.804641122504
0.803022126282
0.801672962763
0.805450620615```

Fine-tuning

After playing around with n_estimators and min_samples_split, I found good values:
clf = RandomForestClassifier(n_estimators=150, min_samples_split=20)
which gives the accuracy:

0.814085267134
0.821370750135
0.818402590394
0.825148407987
0.82245008095

Which gives us a small accuracy increase.

Feature Reduction

Let’s look at the 10 most important features as given by clf.feature_importances:
```[728  26 729   0  40  50  30 390  39  37]
[728  26 729  50   0  40  37  30  34 390]
[728  26 729  50  40   0  37  30  39 390]
[728  26 729   0  50  40  30  37 390  39]
[728  26 729   0  50  40  30  37  39  34]```

These numbers refer to the feature index by most importance.

728 – Last character
26 – Order of a
729 – Second last character
0 – Number of a’s
50 – order of y

40 – order of o

It looks these 6 features are consistently good.
Let’s see how good the top feature is
```def name_count(name):
arr = np.zeros(1)
arr[0] = ord(name[-1])-ord('a')+1
return arr```

Accuracy:

```0.771451699946
0.7536427415
0.753912574204
0.7536427415
0.760658391797```

Wow! We actually get 75% accuracy! This means the last letter of a name is really important in determining the gender.

Let’s take the top three features (last and second last character  and order of a’s) and see the importance of these. (But if you already read the title of this blog post, you should know what to expect.)

```def name_count(name):
arr = np.zeros(3)
arr[0] = ord(name[-1])-ord('a')+1
arr[1] = ord(name[-2])-ord('a')+1
# Order of a's
for ind, x in enumerate(name):
if x == 'a':
arr[2] += ind+1
return arr```

Accuracy:

```0.798165137615
0.794117647059
0.795736643281
0.801133297356
0.803561791689```

I would say 80% accuracy for 3 features is pretty good for determining gender of a name. Thats about the same accuracy as a mammogram detecting cancer in a 45-49 year old woman!

Sample Example

We can sample random datapoints to see how well our model is performing:
```def name_count(name):
arr = np.zeros(3)
arr[0] = ord(name[-1])-ord('a')+1
arr[1] = ord(name[-2])-ord('a')+1
# Order of a's
for ind, x in enumerate(name):
if x == 'a':
arr[2] += ind+1

return arr

my_data = np.genfromtxt('names/yob2014.txt',
delimiter=',',
dtype=[('name','S50'), ('gender','S1'),('count','i4')],
converters={0: lambda s:s.lower()})
my_data = np.array([row for row in my_data if row[2]>=20])
name_map = np.vectorize(name_count, otypes=[np.ndarray])
Xname = my_data['name']
Xlist = name_map(Xname)
X = np.array(Xlist.tolist())

y = my_data['gender']

Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
clf = RandomForestClassifier(n_estimators=150, min_samples_split=20)
clf.fit(Xtr, ytr)

idx = np.random.choice(np.arange(len(Xlist)), 10, replace=False)
xs = Xname[idx]
ys = y[idx]
pred = clf.predict(X[idx])

for a,b, p in zip(xs,ys, pred):
print a,b, p```

Output:

```Name Actual Classified
shea F F
lucero F M
damiyah F F
nitya F F
sloan M M
porter F M
jalaya F F
aubry F F
mamie F F
jair M M```

Conclusion

Many features are good, but finding important features is better.
If you are unsure of a gender of a name, just look at the last letter which gives you a 75% chance of getting it.

I hope you have learned something from reading this blog post as I did writing it!(Click here for Source: IPython Notebook)

• huba baba
April 5, 2016

awsome! great Work

• ayoungprogrammer
April 5, 2016

Thanks!

• Anton Eicher
April 5, 2016

Nice work.

Looks like the random forest is basically examining the last 2 digits of the name. If it ends in for e.g. 'my' or 'da' (e.g. Tammy or Linda) it is female, and if it ends in for e.g. 're' or 'on' (e.g. Andre or Anton) it is male. It's an interesting finding, but I think this makes sense, given that our names mostly derive from Latin, which assigns a gender to each word. For example, words ending in 'o' (like Angelo) are usually masculine, and words ending in 'a' (like Linda) are usually feminine.

• Nelson
April 6, 2016

Nicely done, and thanks for sharing your Python notebook! Your post inspired me to go off on a tangent about what it would mean to share classifiers like the one you trained: https://nelsonslog.wordpress.com/2016/04/06/a-library-of-trained-machine-learning-models/

• Zoom.Quiet
April 22, 2016

NameGenderClassification/Gender Classification of Names.ipynb at master · ayoungprogrammer/NameGenderClassification
https://github.com/ayoungprogrammer/NameGenderClassification/blob/master/Gender%20Classification%20of%20Names.ipynb

• Wilson
May 17, 2017

Nice example and thanks for your work. I am trying to do the similar project but fail. The accuracy is only 65%

• Aniket
October 6, 2017

giving NameError: name ‘xrange’ is not defined in Jupyter notebook

• ayoungprogrammer
December 21, 2017

The notebook is written in Python2. You can fix this by replacing xrange with range in python3

• César
October 11, 2017

Hi, I dont undertood about the 3rd feauture in your archieves for example Mary,F,7065
In this case “7065” what is it?? Regards

• Olah Data Excel dengan Python – BasangData
October 12, 2018

[…] Anda ingin melakukan lebih dengan data yang dimiliki, misal menggunakan Kecerdasan Buatan untuk memprediksi jenis kelamin dari nama. […]

• Domingo Buquet
October 13, 2018