finding answer to “which of these” questions

2019-08-26 07:48发布

问题:

I am writing a Python program for a quiz answer-bot (for educational purposes only) using Tesseract OCR and the google-search-Api. The program seems to be very accurate when dealing with direct question ("who did what", "what is this") but has some problems with questions which include the answers as a part of themselves ("which of these").


import pytesseract
from PIL import Image
from googleapiclient.discovery import build
import json
import unicodedata
import time
import os

#removing non-ASCII characters from OCR
def strip_accents(text):


    text = unicodedata.normalize('NFD', text)\
           .encode('ascii', 'ignore')\
           .decode("utf-8")

    return str(text)

#googling the question using Google-search-Api
def google_search(search_term, api_key, cse_id, **kwargs):
    service = build("customsearch", "v1", developerKey=api_key)
    res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
    return res

#using Tesseract OCR
question = strip_accents(pytesseract.image_to_string(Image.open('/Users/lorenzo/Desktop/live_quiz/question.png'), lang = 'eng'))
answer1 = strip_accents(pytesseract.image_to_string(Image.open('/Users/lorenzo/Desktop/live_quiz/answer1.png'), lang = 'eng'))
answer2 = strip_accents(pytesseract.image_to_string(Image.open('/Users/lorenzo/Desktop/live_quiz/answer2.png'), lang = 'eng'))
answer3 = strip_accents(pytesseract.image_to_string(Image.open('/Users/lorenzo/Desktop/live_quiz/answer.png'), lang = 'eng'))


#creating three new questions by taking the original question and each of the answers
edited_question_1 = question + '? ' + '"' + answer1 + '"'
edited_question_2 = question + '? ' + '"' + answer2 + '"'
edited_question_3 = question + '? ' + '"' + answer3 + '"'


#searching each new question separately
result1 = google_search(edited_question_1, my_api_key, my_cse_id, num = 1)
result2 = google_search(edited_question_2, my_api_key, my_cse_id, num = 1)
result3 = google_search(edited_question_3, my_api_key, my_cse_id, num = 1)


#counting the search results for each google search
num_results_1=int(result1['searchInformation']['totalResults']) 
num_results_2=int(result2['searchInformation']['totalResults'])
num_results_3=int(result3['searchInformation']['totalResults'])

For now, this approach of googling three new questions, each created from the original one plus one of the results, is very inaccurate since the number of results can be conditioned by many other factors which don't involve the actual question (the popularity of one of the answers, for instance).
I was wondering if any of you knew of a better way of approaching this problem in order to improve precision.