GSoC-21 Complete Journey with DBpedia.
Final official Blog ✨
Introduction: 🙋♂️
Hello!, Guten Tag!, Bonjour!, Hola!!!🙏
I am Ashutosh Kumar, a 3rd-year student pursuing my undergraduate in Computer Science. I was selected for Google Summer of Code’21 under DBpedia for the Project: “ DBpedia Live Neural Question Answering Chatbot — GSoC2021”.
Starting from 17th May to 22nd August 2021, it has been a wonderful journey. These 10 weeks were completely packed with a lot of learnings and many firsts.
In this duration, I had the opportunity to work with some very cool and helpful mentors-
Project Description:✍
The project was to build a live chatbot version of the DBpedia Neural Question Answering dataset DBQNA using Google Dialogflow which is connected to a webhook flask server for entity linking and fetching answers from DBpedia Endpoint,
where we use the DBQNA templates as an intent. We convert the DBQNA dataset in CSV format and upload it to the knowledge base of Dialogflow.
FLOW-
So whenever a question is asked in the Dialogflow bot,
Eg-
Question- "Give me the total number of architects of the buildings whose one of the architects was Stanford white?"
Dialogflow triggers an intent from the knowledge base and returns the closest query to the question-
BOT-"SELECT DISTINCT COUNT(?uri) where { ?x dbp:architect “tok A” . ?x dbp:architect ?uri }"
where “tok A” is supposed to be the entity “Stanford white” in the question. So as soon as the response is triggered, it calls the webhook connection of the Dialogflow which is for entity linking.
Then the flask-based web app accepts the request and extracts the entities from the question and executes entity linking. After this, the query is-
Query- "SELECT DISTINCT COUNT(?uri) where { ?x dbp:architect <http://dbpedia.org/resource/Stanford_White>. ?x dbp:architect ?uri }"
This query is fetched from DBpedia Sparql Endpoint and the answer is sent back to Dialogflow from the web app which is then displayed in the bot.
Answer - 34
Now this project was done in two major phases, phase 1 for Dialogflow set up and flask web server setup. phase 2 for entity linking.
Phase 1:📌
Dialogflow Setup and Flask server setup.
In this phase, we started with exploring the DBQNA dataset and thinking about how to upload it to Dialogflow, so we wrote a function that converted DBQNA templates to the Required CSV format of the Knowledge base in Dialogflow.
After conversion, we uploaded the CSV files and our intents were ready.
After this, we started setting up our webhook and responses. We made a flask server with the post method, so that after intent is triggered and the webhook call is made, the flask server will return the answer fetched from the resolved query as a response after entity linking.
In the flask server, we take the input question from the incoming request and pass it through our entity linking function which returns the entities and we replace it in the query which is also part of the incoming request coming from the Dialogflow.
After resolving the query we fetch the response of the query from the DBpedia endpoint and return it to Dialogflow as a response of the flask server.
All the Setup instructions and code can be found — HERE
Phase 2:📌
Entity Linking Experimentation and Research
In the Entity linking part, we devoted most of our time. For this, we started with approaches that are already available i.e. DBpedia Spotlight and Falcon.
We started testing and reading their research papers, how they worked and what their scores were on the LC-Quad dataset.
After analysis, our observation was that both were working on a similar concept, that is first to spot the entities in the sentence and then use them to get the final entity link from the DBpedia database.
The scores for the DBpedia Spotlight on the LC-Quad benchmark dataset were-
precision score: 0.5462900000000006
recall score: 0.5711
F-Measure score: 0.5584195652368469
The scores for the Falcon on the LC-Quad benchmark dataset were -
precision score: 0.7976466666666657
recall score: 0.8628
F-Measure score: 0.8289450758229708
So we divided our Entity linking into 2 parts. First- Token or Entity extraction from the question, Second- getting candidates from DBpedia Lookup and DBpedia endpoint based on that token and Entity disambiguation to select the final entity.
Then we decided to Focus on the second approach first. So instead of doing a proper token extraction on unknown questions and entities, we made templates by comparing the entities in the Benchmark with the entities in the question and by replacing the entities in the question with <> placeholders. After that, we got a template which was the question without entities with the placeholder.
Then we just simply compared the template and question to extract tokens from the sentence or question.
Eg-
Question-"Is Pulau Ubin the largest city of Singapore"Entities in Benchmark-
['http://dbpedia.org/resource/Pulau_Ubin','http://dbpedia.org/resource/Singapore']Template created - "Is <A> the largest city of <B>"Token extracted by comparing question and template-
['Pulau Ubin', 'Singapore']
Then using the token, we called the Api of DBpedia Lookup and DBpedia endpoint to get the candidate entities, the query we used for Endpoint to get the candidates was -
SELECT ?uri ?label WHERE { ?uri rdfs:label ?label . ?label bif:contains "'TOKEN'" } limit 100
After getting the candidates for disambiguation, we counted the intersection of words between candidate entities and questions and divided it by the total words in the candidate entity. Based on that score, we disambiguated the entity.
Disamiguation Formula-
score=(intersection of words / length of entity)
if the score is the same for more than one entity then we select the one with Max no. of intersection with the question
This was not our first approach but one of our improved approaches.
Improving scores was a constant, complete cycle of analyzing the problem, improving, and testing, all the problems we encountered and its solution are mentioned in Google Doc with proper examples.
Just for Disambiguation with token extraction using the template, the score we achieved were-
precision score: 0.8646761377317602
recall score: 0.8646761377317602
F-Measure score: 0.8646761377317602
Then we started focusing on the first part, which is Entity spotting and its extraction without any prior knowledge of what can be the probable entities.
We started with spacy. Using a spacy large model, we Extracted Named Entity and Noun chunks and combined them in a superset manner to get the token. While getting the noun chunks we filter and used only those chunks where noun concentration is greater than equal to 0.5.
The score of spacy approach combined with disambiguation was -
precision score: 0.7907653910149761
recall score: 0.8238352745424293
F-Measure score: 0.8069616678630216
Then we moved on to our second method of Token extraction which will be our main method. Stanford-core NLP-parser + Stanford-Stanza-Ner for (token Extraction ) + Disambiguation- Github
In this approach, we used the Stanford core NLP parser to get the parsed tree, and using the tree we extracted noun phrases from it. Then we used Stanza for Named Entity Recognition, which was merged with Noun phrases in a superset manner to get the final token- click for function Token extractor, Phrase extractor.
One extra feature we had with Stanford was that we could control noun phrase extraction using the Parse tree.
The score for this approach with disambiguation was -
precision score: 0.8037643207855966
recall score: 0.8514729950900164
F-Measure score: 0.8269311077049624
Future enhancements:🚀
- Primarily, we will be analyzing and improving our recent approach and try to match the score of 0.86 which was achieved by us in the disambiguation testing approach.
- We will also be using Expansion and relation-based search to improve our score.
Links of work done: 📝
🔷Documentation Links-
- Setup and Complete Documentation of Dialogflow bot.
- Entity Linking all experiments and scores.
- Problems and improvement encountered while Entity linking.
🔷Code Links-
🔷Main Repository- GitHub
🔷Previous Blogs-
- Introduction-and-good-news
- Community-bonding-17th-May-to-6th-June
- The-coding-period-begins
- Week-3-and-4
- Week-5-and-6
- Week-6-and-7
- Week-8-and-9
Learnings:📚
- Learnt how to plan and do research work step by step.
- Learnt how to make and run pipelines.
- Got to know about benchmarking and scores
- Sharpened my skills in GIT, Flask, and NLP.
- Learnt the importance of time management as well as perfect deliverables.
- Improved my documentation skills.
- Improved my collaboration and communication skills
Overall experience:😄
It was an amazing experience for me to work for DBpedia. I am very thankful to all the mentors for giving me exposure to real-world problems and being so helpful and supportive at every step. Thanks to the whole GSoC community for providing such an amazing environment. This is not the end but a mere beginning of my journey with DBpedia and my mentors.
References :📍
@article{hartmann-marx-soru-2018,
author = {Hartmann, Ann-Kathrin and Marx, Edgard and Soru, Tommaso},
abstract = {The role of Question Answering is central to the fulfillment of the Semantic Web. Recently, several approaches relying on artificial neural networks have been proposed to tackle the problem of question answering over knowledge graphs. },
booktitle = {Workshop on Linked Data Management, co-located with the W3C WEBBR 2018},
title = {Generating a Large Dataset for Neural Question Answering over the {DB}pedia Knowledge Base},
url = {https://www.researchgate.net/publication/324482598_Generating_a_Large_Dataset_for_Neural_Question_Answering_over_the_DBpedia_Knowledge_Base},
year = 2018
}
— — — — — -
@inproceedings{conf/naacl/SakorMSSV0A19,
added-at = {2019-06-07T00:00:00.000+0200},
author = {Sakor, Ahmad and Mulang, Isaiah Onando and Singh, Kuldeep and Shekarpour, Saeedeh and Vidal, Maria-Esther and Lehmann, Jens and Auer, Sören},
biburl = {https://www.bibsonomy.org/bibtex/2b71b43be4bf953384eec726d0067f109/dblp},
booktitle = {NAACL-HLT (1)},
crossref = {conf/naacl/2019-1},
editor = {Burstein, Jill and Doran, Christy and Solorio, Thamar},
ee = {https://aclweb.org/anthology/papers/N/N19/N19-1243/},
publisher = {Association for Computational Linguistics},
timestamp = {2019-06-08T11:38:37.000+0200},
title = {Old is Gold: Linguistic Driven Approach for Entity and Relation Linking of Short Text.},
url = {http://dblp.uni-trier.de/db/conf/naacl/naacl2019-1.html#SakorMSSV0A19},
year = 2019
}
— — — — — -
title = {DBpedia Spotlight: Shedding Light on the Web of Documents},
author = {Pablo N. Mendes and Max Jakob and Andres Garcia-Silva and Christian Bizer},
year = {2011},
booktitle = {Proceedings of the 7th International Conference on Semantic Systems (I-Semantics)},
abstract = {Interlinking text documents with Linked Open Data enables the Web of Data to be used as background knowledge within document-oriented applications such as search and faceted browsing.}
}