A Natural Language Query Engine without Machine Learning

NLPOctober 7, 2016

What is this?

NLQuery is a natural language engine that will answer questions asked in natural language form.

Demo: http://nlquery.ayoungprogrammer.com

Source: http://ayoungprogrammer.github.com/nlquery

Example:

Input: Who is Obama married to?

Output: Michelle Obama

More examples:

Who is Obama? 44th President of the United States
How tall is Yao Ming? 2.286m
Where was Obama born? Kapiolani Medical Center for Women and Children
When was Obama born? August 04, 1961
Who did Obama marry? Michelle Obama
Who is Obama's wife? Michelle Obama
Who is Barack Obama's wife? Michelle Obama
Who was Malcolm Little known as? Malcolm X
What is the birthday of Obama? August 04, 1961
What religion is Obama? Christianity
Who did Obama marry? Michelle Obama
How many countries are there? 196
Which countries have a population over 1000000000? People's Republic of China, India
Which books are written by Douglas Adams? The Hitchhiker's Guide to the Galaxy, ...
Who was POTUS in 1945? Harry S. Truman
Who was Prime Minister of Canada in 1945? William Lyon Mackenzie King
Who was CEO of Apple Inc in 1980? Steve Jobs

Why no machine learning?

Because a labelled dataset for search queries is hard to find and I wanted to see how well my matching library would work. There are finite amount of grammar rules even though there are an infinite amount of queries and we can build a system that matches these rules. It works surprisingly well and is able to handle many different types of queries, however there were some slight hacks I needed to do handle some queries.

How does it work?

The engine first converts the natural language query to a parse tree, interprets the query into a context and then uses the context to perform a SPARQL query on WikiData. Below is an example of the whole flow:

Raw Input

Example of the raw input query string from a user:

"Who is Obama's wife?"

We can do some simple preprocessing to add punctuation and capitalization to the raw input to make it easier to parse in the next step.

Parse Tree

We take the preprocessed string and get the parse tree of the sentence from the Stanford CoreNLP Parser:

(SBARQ
  (WHNP (WP Who))
  (SQ (VBZ is) (NP (NP (NNP Obama) (POS 's)) (NN wife)))
  (. ?))

This parse tree represents the grammatical structure of the sentence and from this we can match the grammar rules to extract the context.

Context

We can convert the grammar parse tree to context parameters by matching the tree with rules. We can doing this using my library for matching parse trees: Lango.

{
    "( SQ ( VP ( VBZ/VBD/VBP:action-o ) ( NP:subj_t ) ) )": {
        subj_t: "( NP ( NP:subject-o ( NNP ) ( POS ) ) ( NN/NNS:prop-o )"
    }
}

This grammar rule matches the parse tree and we can extract some context from the corresponding symbols in the rule.

{
  "prop":"wife",
  "qtype":"who",
  "subject":"obama"
}

We have the subject “Obama”, the property “wife” and the question type “who”. Once we have the contextual parameters of the query, we can construct a SPARQL query to query the WikiData database.

WIkidata SPARQL Query

Wikidata is a free and open knowledge base that can be read and edited by both humans and bots that stores structured data. It uses a graph database to store the data and has an endpoint for a SPARQL graph query. In the high level, entities are represented as nodes and properties of the entities as edges. Every relationship is stored as a triple e.g. (entity:Q76 property:26 entity:13133). This triple represents the relation that entity:Q76 (Obama) has property:26 (spouse) with entity:13133 (Michelle Obama). So if we are querying for the entity that is Obama’s spouse, we are looking for triple of the form (entity:Q76 property:26 ?x) where ?x the unknown entity we are looking for. The SPARQL syntax is beyond the scope of this blog post and if you are interested, you can learn more about the WikiData SPARQL here.

For this application, we will consider two types of SPARQL queries:

finding property of an entity (e.g. Who is Obama’s wife?)
1. We can search for the property that matches the entity (e.g.entity:Obama property:spouse ?x)
finding instances of entities with given properties (e.g. Which POTUS died from laryngitis?)
1. We can search for entities that are instances of the type we want that match the properties. E.g. which books are written by Douglas Adams: (?x property:instanceOf entity:book AND ?x property:writtenBy entity:DouglasAdams)
2. There are some extra cases needed to handle for this such as “positions held” that are a type of entity but is not an instance of. (?x property:positionHeld entity:POTUS AND ?x property:causeOfDeath entity:laryngitis)

Our SPARQL query for the example:

SELECT ?valLabel ?type
WHERE {
{
        wd:Q76 p:P26 ?prop . 
        ?prop ps:P26 ?val .
        OPTIONAL {
            ?prop psv:P26 ?propVal .
            ?propVal rdf:type ?type .
        }
    }
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en"} 
}

Result

End result from querying WikiData:

{
    head: {
        vars: [
            "valLabel",
            "type"
        ]
    },
    results: {
        bindings: [
            {
                valLabel: {
                    xml:lang: "en",
                    type: "literal",
                    value: "Michelle Obama"
                }
            }
        ]
    }
}

Thus we get the final answer as “Michelle Obama”.

What else will you add?

Some ideas I have to extend this further would be to:

Add other data sources (e.g. DBPedia)
Spell check in preprocessing

This is cool! How can I help?

The code is relatively short and simple (~1000 lines with comments) and it should be easy to dive in and make your own pull request!