Identifying the identifier

The most important Job for us to get right to be a helpful price comparison platform for fashion is to find the same product in as many shops as possible. At first sight this sounds like an easy enough task to perform but structured meta data is scarce while metadata embedded in flowing text is relatively abundant. In this post I will explore our attempt to find IDs in product names and descriptions. We based our solution on a basic machine learning technique, the identification tree; which we also made dynamic to be able to retrain it at runtime.

How it works

There is a set of rules that make a binary decision whether a word is an ID based on a preferably atomic observation. When put into words rules read like this:
This word is an ID because it

  • consists of only uppercase letters and numbers
  • contains more than 40% numbers
  • contains at least one of these characters:.-_/
  • is at least 6 characters long

Each of these bullet points is implemented as one rule.

Then we need a training set that contains example words and the
information whether we would consider this word an ID.

word is ID?
dog false
12345 true
AI8NG5 true
GH-12 true
abc-9 false

Given these sets of code and data we create an algorithm that uses the
training set to arrange the rules into a tree that is ultimately able
to identify all examples correctly as either an ID or not an ID.

Building the tree (i.e. training)

The algorithm executes each rule on all examples and for each rule
records into which of the four possible result categories each example
falls.

example correct positive incorrect positive correct negative incorrect negative
dog x
12345 x
AI8NG5 x
GH-12 x
abc-9 x

With this information we can calculate the diversity of each identification attempt – meaning how big of a mistake each rule made – using this formula:

diversity formula
diversity distribution

We conveniently define anything divided by 0 to be 0 and the 2s logarithm of 0 to be -1. This results in a distribution like the one to the right where the “all correct positives” score lies on the far left, the “all correct negatives” score on the far right and “getting everything wrong” dead center. The beauty of this score is that the difference between making no mistakes and one mistake has a big influence while getting 50% wrong and failing everything scores equally bad. Lower diversity scores consequently identify better rules.

If the best rule is chosen we can use this rule to divide the example set into two subsets and then recursively apply all rules again to further divide the example set into smaller and smaller subsets. The recursion stops when a subset only contains correct positives or correct negatives.

example identification tree

Traversing the tree to identify unknown words

If an unknown word has to be identified the root rule is executed on the rule. Depending on the outcome we follow either the left branch or the right branch all the way down to a leaf node that then tells us whether the word is considered an ID or not an ID.

Dynamically extending the training set and rebuilding the tree

Usually IT does not find all required examples to build a tree that correctly deals with all possible situations. Therefore it might make sense to put the training set into a spreadsheet that the identifier periodically imports to rebuild the tree at runtime.

Caveats

The identification tree is an almost misleadingly simple concept to implement because the algorithm is written in a few hours and it’s easy to judge when the code itself works correctly. But as with all machine learning techniques it takes a rather long time to compile a comprehensive and diverse training set to build an efficient tree. Also coming up with rules and implementing them correctly takes careful thought and vigorous automated testing. As a point of reference: our current implementation uses about 500 examples, 32 rules, and generates a tree with 104 nodes. It took a few weeks to gather the examples and implement the rules and we still discover new examples on a regular basis.

Things to consider when developing examples and rules

It’s generally possibly for a rule to appear multiple times on a path in the tree but it’s also a hint that the rule might not work that well

The tree building algorithm can run into a situation where it cannot find a rule that can further divide a subset of examples. This can lead to hard to detect infinite loops that ultimately crash with a stack overflow error.

There is always the risk of conflicting examples in the training set. These also lead to infinite loops while building a tree.

The example subset that leads to an infinite loop provides insight into which rules might be missing. Include the currently processed subset in error messages.

There are basically 3 kinds of rules:

  • maybe positive / maybe negative
  • definite positive / unsure
  • unsure / definite negative

The first kind is very helpful for dividing a large example set into smaller more manageable subsets. The other two facilitate making a final decision on whether a word is considered an ID or not an ID. Use all three. Intentionally.