Determining Gender of a Name with 80% Accuracy Using Only Three Features

UncategorizedApril 4, 2016

Introduction

I thought an easy project to learn machine learning was to guess the gender of a name using characteristics of the name. After playing around with different features by encoding characters of the name, I discovered you only needed THREE features for 80% accuracy which is pretty impressive. I am by no means an expert at machine learning, so if you see any errors, feel free to point them out.

Example:

Name Actual Classified
shea F F
lucero F M
damiyah F F
nitya F F
sloan M M
porter F M
jalaya F F
aubry F F
mamie F F
jair M M

(Click here for Source: IPython Notebook)

Dataset

The dataset used for getting names was from SSN’s baby names dataset for the year 2014.

https://www.ssa.gov/oact/babynames/names.zip

Methodology

I took all the baby names from the dataset that had at least 20 people for male and female since I found many names were low quality when they are least used (for example, there are a few guys named Amy born in 2014).

Loading

Code for loading data from dataset into numpy arrays ready for machine learning

import numpy as np from sklearn.cross_validation
import train_test_split, cross_val_score from sklearn.ensemble
import RandomForestClassifier from sklearn
import svm my_data = np.genfromtxt('names/yob2014.txt', delimiter=',', dtype=[('name','S50'), ('gender','S1'),('count','i4')], converters={0: lambda s:s.lower()})
my_data = np.array([row for row in my_data if row[2]>=20])
name_map = np.vectorize(name_count, otypes=[np.ndarray])
Xlist = name_map(my_data['name'])
X = np.array(Xlist.tolist())
y = my_data['gender']

X is an np.array of N * M, where N is number of names and M is number of features
y is M or F
name_map will be a function that converts a name (string) to an array of features

Fitting and Validation

We will be splitting the data into training and testing for cross-validation and using RandomForrest for classification since it performs well at classifying data.

for x in xrange(5):
 Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
 clf = RandomForestClassifier(n_estimators=100, min_samples_split=2)
 clf.fit(Xtr, ytr)
 print np.mean(clf.predict(Xte) == yte)

By default, RandomForest will set max_features(number of features to look at before split) = n_features which is recommended for classification problems (http://scikit-learn.org/stable/modules/ensemble.html#parameters). We will be using n_estimator (number of trees) of 100 and a min_samples_split (the minimum number of samples required to split an internal node) of 2 which we will tune when we determine a good feature set.

Picking Features

Character Frequency

My first attempt at features was the frequency of each character:

def name_count(name):
 arr = np.zeros(52)
 for ind, x in enumerate(name):
 arr[ord(x)-ord('a')] += 1
 return arr

Example:

aaabd
freq: [a:3, b:1, d:1]

* Note that we encode freq as an array using index of letter. e.g.: [3, 1, 0, 1, 0, 0, …. 0]. Most of the array will be zeroes.

Accuracy:
0.690232056125

0.692390717755

0.693739881274

0.688073394495

0.694819212089

Not bad for simple features.

Character Frequency + Order

Second attempt at features is frequency + ordering:

def name_count(name):
 arr = np.zeros(52)
 for ind, x in enumerate(name):
 arr[ord(x)-ord('a')] += 1
 arr[ord(x)-ord('a')+26] += ind+1
 return arr

Example: aaabc
freq: [a:3, b:1, c:1]
ord: [a:6, b:4, c:5]

We can combine these encodings by adding the two arrays together and offsetting the second array

Accuracy:
0.766864543983
0.760388559093
0.766864543983
0.76740420939
0.759848893686

We are getting somewhere!

Character Frequency + Order + 2-grams

Let’s trying adding all the 2-grams in the name as features to see if we can get more info.

def name_count(name):
 arr = np.zeros(52+26*26)
 # Iterate each character
 for ind, x in enumerate(name):
 arr[ord(x)-ord('a')] += 1
 arr[ord(x)-ord('a')+26] += ind+1
 # Iterate every 2 characters
 for x in xrange(len(name)-1):
 ind = (ord(name[x])-ord('a'))*26 + (ord(name[x+1])-ord('a'))
 arr[ind] += 1
 return arr

Example: aaabc

freq: [a:3, b:1, c:1]
ord: [a:6, b:4, c:5]
2-gram: [ aa: 2, ab: 1, bc: 1]

We can encode 2-grams by converting from base 26, e.g.-> aa = 0, bc = 26 + 2 = 28

Accuracy:

0.78548300054
0.771451699946
0.783864004317
0.777388019428
0.77172153265

We get a slight increase in accuracy, but I think we can do better.

Character Frequency + Order + 2-grams + Heuristics

Examining the names more in depth, I hypothesized that the length of name and last and second character of the name could be important.

def name_count(name):
 arr = np.zeros(52+26*26+3)
 # Iterate each character
 for ind, x in enumerate(name):
 arr[ord(x)-ord('a')] += 1
 arr[ord(x)-ord('a')+26] += ind+1
 # Iterate every 2 characters
 for x in xrange(len(name)-1):
 ind = (ord(name[x])-ord('a'))*26 + (ord(name[x+1])-ord('a')) + 52
 arr[ind] += 1
 # Last character
 arr[-3] = ord(name[-1])-ord('a')
 # Second Last character
 arr[-2] = ord(name[-2])-ord('a')
# Length of name
arr[-1] = len(name)
return arr

Example: aaabc

freq: [a:3, b:1, c:1]
ord: [a:6, b:4, c:5]
2-gram: [ aa: 2, ab: 1, bc: 1]
last_char: 3
second_last_char: 2
length: 5

Accuracy:

0.801672962763
0.804641122504
0.803022126282
0.801672962763
0.805450620615

Fine-tuning

After playing around with n_estimators and min_samples_split, I found good values:

clf = RandomForestClassifier(n_estimators=150, min_samples_split=20)

which gives the accuracy:

0.814085267134
0.821370750135
0.818402590394
0.825148407987
0.82245008095

Which gives us a small accuracy increase.

Feature Reduction

Let’s look at the 10 most important features as given by clf.feature_importances:

[728  26 729   0  40  50  30 390  39  37]
[728  26 729  50   0  40  37  30  34 390]
[728  26 729  50  40   0  37  30  39 390]
[728  26 729   0  50  40  30  37 390  39]
[728  26 729   0  50  40  30  37  39  34]

These numbers refer to the feature index by most importance.

728 – Last character

26 – Order of a

729 – Second last character

0 – Number of a’s

50 – order of y

40 – order of o

It looks these 6 features are consistently good.

Let’s see how good the top feature is

def name_count(name):
 arr = np.zeros(1)
 arr[0] = ord(name[-1])-ord('a')+1
 return arr

Accuracy:

0.771451699946
0.7536427415
0.753912574204
0.7536427415
0.760658391797

Wow! We actually get 75% accuracy! This means the last letter of a name is really important in determining the gender.

Let’s take the top three features (last and second last character and order of a’s) and see the importance of these. (But if you already read the title of this blog post, you should know what to expect.)

def name_count(name):
 arr = np.zeros(3)
 arr[0] = ord(name[-1])-ord('a')+1
 arr[1] = ord(name[-2])-ord('a')+1
 # Order of a's
 for ind, x in enumerate(name):
 if x == 'a':
 arr[2] += ind+1
 return arr

Accuracy:

0.798165137615
0.794117647059
0.795736643281
0.801133297356
0.803561791689

I would say 80% accuracy for 3 features is pretty good for determining gender of a name. Thats about the same accuracy as a mammogram detecting cancer in a 45-49 year old woman!

Sample Example

We can sample random datapoints to see how well our model is performing:

def name_count(name):
 arr = np.zeros(3)
 arr[0] = ord(name[-1])-ord('a')+1
 arr[1] = ord(name[-2])-ord('a')+1
 # Order of a's
 for ind, x in enumerate(name):
 if x == 'a':
 arr[2] += ind+1
 
 return arr

my_data = np.genfromtxt('names/yob2014.txt', 
 delimiter=',', 
 dtype=[('name','S50'), ('gender','S1'),('count','i4')],
 converters={0: lambda s:s.lower()})
my_data = np.array([row for row in my_data if row[2]>=20])
name_map = np.vectorize(name_count, otypes=[np.ndarray])
Xname = my_data['name']
Xlist = name_map(Xname)
X = np.array(Xlist.tolist())

y = my_data['gender']

Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
clf = RandomForestClassifier(n_estimators=150, min_samples_split=20)
clf.fit(Xtr, ytr)

idx = np.random.choice(np.arange(len(Xlist)), 10, replace=False)
xs = Xname[idx]
ys = y[idx]
pred = clf.predict(X[idx])

for a,b, p in zip(xs,ys, pred):
 print a,b, p

Output:

Name Actual Classified
shea F F
lucero F M
damiyah F F
nitya F F
sloan M M
porter F M
jalaya F F
aubry F F
mamie F F
jair M M

Conclusion

Many features are good, but finding important features is better.

If you are unsure of a gender of a name, just look at the last letter which gives you a 75% chance of getting it.

I hope you have learned something from reading this blog post as I did writing it!(Click here for Source: IPython Notebook)

huba baba

April 5, 2016

awsome! great Work

ayoungprogrammer

April 5, 2016


Thanks!

Anton Eicher

Nice work.

Looks like the random forest is basically examining the last 2 digits of the name. If it ends in for e.g. 'my' or 'da' (e.g. Tammy or Linda) it is female, and if it ends in for e.g. 're' or 'on' (e.g. Andre or Anton) it is male. It's an interesting finding, but I think this makes sense, given that our names mostly derive from Latin, which assigns a gender to each word. For example, words ending in 'o' (like Angelo) are usually masculine, and words ending in 'a' (like Linda) are usually feminine.

Nelson

April 6, 2016

Nicely done, and thanks for sharing your Python notebook! Your post inspired me to go off on a tangent about what it would mean to share classifiers like the one you trained: https://nelsonslog.wordpress.com/2016/04/06/a-library-of-trained-machine-learning-models/

Zoom.Quiet

April 22, 2016

Click here for Source: IPython Notebook

losted link, i guess is:
NameGenderClassification/Gender Classification of Names.ipynb at master · ayoungprogrammer/NameGenderClassification
https://github.com/ayoungprogrammer/NameGenderClassification/blob/master/Gender%20Classification%20of%20Names.ipynb

Wilson

May 17, 2017

Nice example and thanks for your work. I am trying to do the similar project but fail. The accuracy is only 65%

Aniket

October 6, 2017

giving NameError: name ‘xrange’ is not defined in Jupyter notebook

ayoungprogrammer

December 21, 2017


The notebook is written in Python2. You can fix this by replacing xrange with range in python3

César

October 11, 2017

Hi, I dont undertood about the 3rd feauture in your archieves for example Mary,F,7065
In this case “7065” what is it?? Regards

ayoungprogrammer

December 21, 2017


7065 is the encoded trigram of letters in base 26. Example: https://www.minus40.info/sky/alphabetcountdec.html

asha

November 28, 2018


can you please explain me the above encoding for mary?

Olah Data Excel dengan Python – BasangData

October 12, 2018

[…] Anda ingin melakukan lebih dengan data yang dimiliki, misal menggunakan Kecerdasan Buatan untuk memprediksi jenis kelamin dari nama. […]

Domingo Buquet

October 13, 2018

Thanks for this article, keep it up!

Determining Gender of a Name with 80% Accuracy Using Only Three Features

Introduction

Dataset

Methodology

Loading

Fitting and Validation

Picking Features

Character Frequency

Character Frequency + Order

Character Frequency + Order + 2-grams

Character Frequency + Order + 2-grams + Heuristics

Fine-tuning

Feature Reduction

Sample Example

Conclusion

13 Comments

Leave a Reply to Aniket Cancel reply