Monday, October 14, 2013

Python N-gram Map

I have developed a data structure in Python to store and query n-grams which is released as open source here. This blog post shall give examples on how to use it.

To add n-grams


Adding the n-grams (a,a,a), (a,a,b), (a,b,a), (b,a,a) which map to None

x = NGramMap()
x[('a','a','a')] = None
x[('a','a','b')] = None
x[('a','b','a')] = None
x[('b','a','a')] = None

To count n-gram frequency


Adding the n-grams (a,a,a), (a,a,b), (a,b,a), (b,a,a) several times which map to the number of times they were encountered.

ngrams = [ ('a','a','a'), ('a','a','b'), ('a','a','a'), ('a','a','b'), ('a','b','a'), ('b','a','a'), ('a','b','a'), ('b','a','a'), ('a','b','a'), ('a','b','a') ]

x = NGramMap()
for ngram in ngrams:
    if ngram not in x:
        x[ngram] = 0
    x[ngram] += 1

for ngram in x.ngrams():
    print(ngram, x[ngram])

To find n-grams which contain particular elements


Adding the n-grams (a,a,a), (a,a,b), (a,b,a), (b,a,a) several times which map to the number of times they were encountered.

ngrams = [ ('a','x','y'), ('x','a','y'), ('a','x','y'), ('b','x','y'), ('c','x','z')  ]

x = NGramMap()
for ngram in ngrams:
    if ngram not in x:
        x[ngram] = 0
    x[ngram] += 1

for ngram in x.ngrams_with_eles({ 'a', 'y' }):
    print(ngram, x[ngram])

To find n-grams which follow a particular pattern


Adding the n-grams (a,a,a), (a,a,b), (a,b,a), (b,a,a) several times which map to the number of times they were encountered.

ngrams = [ ('a','x','y'), ('x','a','y'), ('a','x','y'), ('b','x','y'), ('c','x','z')  ]

x = NGramMap()
for ngram in ngrams:
    if ngram not in x:
        x[ngram] = 0
    x[ngram] += 1

for ngram in x.ngrams_by_pattern(( None, 'x', 'y' ), { 0 }):
    print(ngram, x[ngram])

To find elements which share similar contexts


You can find elements which occur in the same context in their n-grams, for example 'a' and 'b' share a context in the n-grams (a, x, y) and (b, x, y) as do 'p' and 'q' in the n-grams (x, p, y) and (x, q, y).

ngrams = [ ('a','x','y'), ('x','a','y'), ('a','x','y'), ('b','x','y'), ('c','x','z')  ]

x = NGramMap()
for ngram in ngrams:
    if ngram not in x:
        x[ngram] = 0
    x[ngram] += 1

for element in x.elements():
    print(element)
    for ngram in x.ngrams_with_eles({ element }):
        ele_index = ngram.index(element)
        for ngram_ in x.ngrams_by_pattern(ngram, { ele_index }):
            if ngram_[ele_index] != element:
                print("\t", ngram_[ele_index])