# Containment

You can calculate n-gram counts using count vectorization, and then follow the formula for containment:

containment A = answer text S = source text

If the two texts have no n-grams in common, the containment will be 0, but if all their n-grams intersect then the containment will be 1. Intuitively, you can see how having longer n-gram's in common, might be an indication of cut-and-paste plagiarism.

containment_example

def containment(ngram_array):
    ''' Containment is a measure of text similarity. It is the normalized, 
       intersection of ngram word counts in two texts.
       :param ngram_array: an array of ngram counts for an answer and source text.
       :return: a normalized containment value.'''
    
    
    
    count_ngram_a = ngram_array[0] == 1
    numerator = sum(count_ngram_a==ngram_array[1])
    # your code here
    
    print (count_ngram_a,numerator)
    return numerator / sum(count_ngram_a)


# row_0 = text 1
# row_1 = text 2

ngram_array = array([ [1, 1, 1, 0, 1, 1],
                      [0, 0, 1, 1, 1, 1]
                    ], dtype=int64)

← Bayes Theorem /nlp/rnn/intro.html →