Improving Agency Matching with Machine Learning Published on October 25, 2022 by
Oliver
in Documentation, New Features, Stock Performer

One of Stock Performer’s most useful feature is the ability to combine the same image across multiple agencies. Our customers want to know how much a file earned in total, regardless of where it was uploaded. We introduced this feature in 2015 and ever since, we’ve distinguished between an upload, that is, one image uploaded to one agency, and a file, that same image uploaded to multiple agencies: If you upload an image to five different agencies, that’s one file and five uploads.

In our announcement from 2015, we already hinted at some of the difficulties of matching images across agencies: Thumbnails contain watermarks and are often processed slightly. Agencies like to sharpen thumbnails and apply colour correction. The binary files are never identical so we needed to use a kind of visual analysis to determine which files are the same. And it had to work for the 100+ million files we currently manage.

The Problem

We saw that it worked quite well for a large majority of matches:

But then there was a small percentage of images where our algorithm failed:

The colour composition in these two images is very similar, an area with skin tones in the middle with some greys, surrounded by plain white. Our algorithm has no understanding of the image’s content so it does not know that one is a hand and the other is a woman. These two images were simply combined into one file. Photographers producing a lot of “on white” pictures were predominantly affected by this.

The Consequences

While only a low single-digit percentage of images were wrongly matched, it resulted in problems that were irritating for some of our customers. For example, a collection for a recent photo shoot would suddenly also contain a photo uploaded many years ago when the algorithm (wrongly) determined that one of the new files was the same as an old one.

It also required our customers to review and fix agency matches. To a certain degree, this will always be required as no algorithm will ever be perfect. But our goal, of course, is to require as little manual intervention as possible.

The Improvement

We just launched a major improvement to the algorithm which we’ve been working on for a couple of months now. As you undoubtably have heard, Machine Learning tools are shaking up the industry at the moment, with models such as Dall-E 2 or Stable Diffusion allowing users to create any type of picture just by entering a text prompt. These models are built upon other models which work the other way around: They identify the semantic content of a picture. Now we can use these models to determine if two pictures are the same, not just by comparing the colours but by comparing their deeper structural content as the Machine Learning model sees them.

We’ve run lots of analyses using this new metric and found it to improve agency matching significantly. We expect it to produce far fewer errors and require much less manual intervention than our previous algorithm.

The Result

Of course, Machine Learning is also not perfect and there may still be a small number false positives (matching the wrong two images) and false negatives (no match where there should be one). Minor differences predominantly found in vector files still pose a problem:

Videos are also difficult to match as every agency picks a different frame to use as the thumbnail and the differences in these frames are often enough to trigger a “no match”:

There is, however, something you can do about this: For a number of years now, we’ve also been looking at an upload’s original filename, that is, the name you gave the file before uploading it to the agency. If you pick a unique filename for your files and make sure they have that filename when you upload them to the agency, our algorithm will match them automatically. A good filename may include a date (e.g. the date of a photo shoot), a brief description, and a number, for example:

2022-10-18-businessmeetings-0042.jpg

This is helping a good number of our customers to ensure perfect matches, especially when producing videos. (The flip side is that sometimes, common filenames such as IMG_2022.JPG are also matched despite being different photos. But this is exceedingly rare.)

Please note that this filename-based matching currently only works for some agencies as not all of them provide “original filename” information to us.

The Advice

Over the years, a number of questions have come up which we would like to address here:

What to do about wrong matches in collections?

Let’s say you have a collection which contains the following file which was matched incorrectly:

One of them belongs in the collection, the other one does not. Stock Performer doesn’t know which one is the correct one so when you break them apart, both will remain in the collection. If you simply remove the whole file, all of the uploads will be removed from the collection. What you need to do instead is break them apart first, then remove the wrong one. There is a handy little button that appears after you’ve changed the agency matching of a file which allows you do do that:

An upload was accepted multiple times by an agency.

Unfortunately, this happens more often than one would expect. Agencies don’t seem to have tools in place that detect that a file has already been accepted. In any case, Stock Performer can only match one file for each agency. So a duplicate will end up not matched. Most of the time, if there is a duplicate at an agency, only one of them will have sales and the other one won’t. You may want to put the upload that has sales in the matched file if you can.

I still have many wrong matches in my account.

It is very likely that these were matched in the old version of the algorithm. We did not want to re-match everything as that would lead to total chaos in your data and especially in your collections. However, if you believe that re-matching all of your files will not lead to any problems for you, you can get in touch with us so we can do that for you.

Generally speaking, if you have never (or only rarely) manually fixed wrong matches and you find that there are a lot of them in your account, it could be beneficial to redo it all now. But if you’ve spent a lot of time reviewing and correcting wrong matches in the past, it’s probably not a good idea to discard all of that hard work as you may have to review all matches again.

The Conclusion

As always, we strive to provide accurate data to you and improving agency matching is one of the ways to increase quality of our service. We are confident that this will get rid of many issues that you’ve reported to us in the past.

Please feel free to get in touch with us if you have any questions.

Improving Agency Matching with Machine Learning Published on October 25, 2022 by Oliver in Documentation, New Features, Stock Performer