Week 8 and 9
(2nd Aug — 16Aug)
Hii again !!😄
Now we’re in the last working week of the GSOC 2021 period.
In the previous blog, I mentioned that we were focusing on the spotting algorithm. Also, the mentioned approaches in this blog, with our own spotting functions will be the independent approaches. Which was not the case with the previous ones as we were extracting token using templates created from already known benchmark entities. But these approaches are completely independent.
Using a spacy large model we Extracted Named Entity and Noun chunks and combined them in a superset manner to get the token. While getting noun chunks we filter and use only those chunks where noun concentration is greater than equal to 0.5.
Here you can check the function.
Eg -
question- "What is the birth place of the cast of Lagnacha Dhumdhadaka"
NER-['Lagnacha Dhumdhadaka']
Noun Chunks-['the birth place', 'the cast', 'Lagnacha Dhumdhadaka'] Final Extracted token: ['Lagnacha Dhumdhadaka']
After this token extraction, other functions were almost the same as previous ones, like for getting candidates using DBpedia lookup and Sparql endpoint and disambiguation approach.
we tested it combining with the complete process and the score we got was -
precision score: 0.7907653910149761
recall score: 0.8238352745424293
F-Measure score: 0.8069616678630216
We also had a question that while disambiguation what if we compare the candidate entities only with extracted tokens instead of the complete question. So we tested this as well and the F1 score we got was -
0.76
Then we moved on to our second method of Token extraction which will be our main method.
2. Stanford-corenlp-parser + Stanford-Stanza-Ner for (token Extraction ) + Disambiguation- Github
In this approach, we used the Stanford core NLP parser to get the parsed tree, and using the tree we extracted noun phrases from it. Then we used Stanza for Named Entity Recognition. which was merged with Noun phrases in a superset manner to get the final token- click for function Token extractor, Phrase extractor.
One extra feature, we had with Stanford was that we could control noun phrase extraction using Parse tree.
Eg-
Question- "Name the university whose athletic department is called National Collegiate Athletic Association and has a chancellor named Nicholas S. Zeppos"
Parse tree which we get -
(ROOT
(S
(VP (VB Name)
(NP
(NP (DT the) (NN university))
(SBAR
(WHNP (WP$ whose)
(NML (JJ athletic) (NN department)))
(S
(VP
(VP (VBZ is)
(VP (VBN called)
(NP
(NML (NNP National) (NNP Collegiate))
(JJ Athletic) (NN Association))))
(CC and)
(VP (VBZ has)
(NP
(NP (DT a) (NN chancellor))
(VP (VBN named)
(NP (NNP Nicholas) (NNP S.) (NNP Zeppos))))))))))))
From this we get -
as noun phrases- ['National Collegiate Athletic Association', 'Nicholas S. Zeppos']
We get the same response from stanza ner, so our final tokens remain the same.
Then using the tokens we get the final entities after disambiguation.
The score for this approach using the Lc-Quad dataset is -
precision score: 0.8037643207855966
recall score: 0.8514729950900164
F-Measure score: 0.8269311077049624
we had to run both approaches multiple times to build the correct function for token extraction.
All of these are properly documented with the dataset, code, and observations here — GitHub
Is it the end?
No, we are continuing the research work to improve on these scores and add-relation linking later as well.
Do remember to check out my Final Blog on Gsoc, which will cover my entire journey and learnings under GSOC 2021 DBpedia.
Thank you for your interest.
Take care.🙏😀🙌