Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We've just open-sourced our library imagededup, a Python package that simplifies the task of finding exact and near duplicates in an image collection.

It includes several hashing algorithms (PHash, DHash etc) and convolutional neural networks. Secondly, an evaluation framework to judge the quality of deduplication. Finally easy plotting functionality of duplicates and a simple API.

We're really excited about this library because finding image duplication is a very important task in computer vision and machine learning. For example, severe duplicates can create extreme biases in your evaluation of your ML model (check out the CIFAR-10 problem). Please try out our library, star it on Github and spread the word! We'd love to get feedback.



This looks really interesting. Can you give us an idea of the performance? E.g. roughly how long would it take to process 1 million 1920×1080 JPEGS, without GPU and with GPU?

What is the scaling like? E.g. what if it was 10 million?


Scale is unfortunately not the focus of the current implementation. We would address this aspect in the future releases however. Considering the speed and memory requirements, following are the current considerations: 1. Hashing methods: Generation of hashes is quick (a couple of seconds on about 10K images). The tricky part is the retrieval of duplicates, which on the same 10K dataset takes a few minutes. (I would refrain from giving exact numbers since this was done on a local system, not a very good environment for benchmarking) 2. CNN: Here, the time consuming part is encoding generation, which, in the absence of GPU would take much more time (a couple of minutes on 10k images). The retrieval part is pretty quick, but requires memory. So, at this point, using this package on a scale of more than a couple of thousand images is not a good idea when done locally. We would however, address the scale aspect of the problem in future releases. Thanks for your question.


Doesn't sound too bad. But can you elaborate on the last part, about retrieval requiring memory and not scaling to more than a couple of thousands images?

How much memory would you need for ~2000 images, how slow does it get, etc.

Thx


Retrieval using CNNs requires computing a cosine similarity matrix. So, for 'n' images, a matrix of size n x n would need to be stored in the memory. As you can see, the storage requirements blow up quadratically as 'n' increases. We have already made some optimizations to reduce the memory footprint but there would be clear upper limits to it (We haven't experimented). As for the numbers, the cifar10 dataset example present in the repo was carried out on google colab notebook. The dataset has 60k images and ended up consuming about 6G of RAM using CNN. However, since there hasn't yet been a principled benchmarking done on the package, I would consider these numbers as only marginally indicative of the capabilities. A better way to figure out if it works for you would be to try it out with your own dataset, starting out at a small scale and increasing the dataset size gradually.


Use approximate nearest neighbor. The FAISS library is good.


Thanks for the pointer, will check it out.


Couldn't you add images>threshold to a dict/map as you iterate, rather than building a complete matrix, then iterating through that?


We tried that approach, but it was way too slow.


"The tricky part is the retrieval of duplicates, which on the same 10K dataset takes a few minutes"

"Doesn't sound too bad."

I doesn't? As a word of caution: If a 10k dataset takes a few minutes I would be careful how long 10MM pictures take. I predict it does not take 10MM/10k times a few minutes.


What way is the hash represented? Have you looked at NN libraries like faiss [0] and NGT [1]? Those can quite easily handle a nearest neighbor search of 10 million vectors and, from my understanding, they turn their vectors into some kind of hash that is then searched.

[0] - https://github.com/facebookresearch/faiss

[1] - https://github.com/yahoojapan/NGT


The hashes are 16 character hexadecimals represented as strings. Had a quick look at the faiss package and it looks promising. Would consider it for the next versions.


If you're interested in collaboration I'd be happy to help with a prod-focused version. My work has a need for a shardable daemon for dedup tasks. My personal email is in my description and I'm also available via josh@xix.ai.

We also have an image heavy production use case that would be able to yield some nice metrics from this tool.


I use Cython (CPU) to brute force 400,000 pHash's of images. It takes somewhere between 100 and 200ms to search.


Only realized now that the 100-200ms time you refer to is for a single search and not for 400,000 searches. The package already achieves this brute-force speed. In fact, the package also implements bktree, which, depending upon the distance threshold passed, could drastically reduce the search time. Moreover, the search through bktree is also parallelized in the package(each image's hash gets searched through the tree independently after the tree is constructed). On one of the example dataset containing 10k images, with a distance threshold of 10 (for 64-bit hashes), the retrieval time per image obtained was < 50 ms.


The 100-200ms time I referred to was indeed a single search. The difference is, it's on a single core. Cython definitely makes the hamming distance function faster.


That sounds good! Would be great if you can share the code, or even better, make a PR to the repo.


The implementation is trivial, for speed we use:

cdef extern int __builtin_popcountll(unsigned long long) nogil

dist = __builtin_popcountll(key ^ phash)

It would only take a couple of minutes to fill out the rest.


How long did the generating take?


We store pHash'es in a database, but just quickly checking on my laptop, between 1-2ms to generate a single image pHash. IO could be significant for many images.


Thanks! This is a very important problemspace for us, as we do document processing and double/triple uploads are hard to detect and create unnecessary effort.

Can you elaborate a little bit on the performance you've observed? Can it work iteratively, basically asking one image at a time "have we already seen this?"

Would it work for text documents of varying quality aswell or is this unfeasible?


Do you use OCR on the documents?


Did you consider using the PhotoDNA hash algo for finding duplicates? If you’ve heard of it and ruled it out, love to know why. While designed for a very different (and dark) purpose, seems like it might do well for the task.


Just had a look, thanks for the pointer. And yes, it was designed for a dark purpose. Will try to find the comparisons with the currently implemented hashing methods and see if there's merit to implementing it.


Thanks Dat for open sourcing it, once again I'm shamelessly linking my Kaggle Kernel with `imagededup` installed https://www.kaggle.com/nulldata/image-duplicates-without-dee.... This one used `PHash()` and a run time less than 88 seconds.

Anyone can just simply fork this kernel and add their dataset to it and perform the task!


Very nice. If I may ask for clarification on the CNN method (which seems to work quite well), you're taking a CNN that's built for classification but removing the last layer (which I believe is fully connected?) so that you only take the "internal features", is that correct? I would expect some kind of autoencoder would fit here, but very interesting that this works.


This is the same idea that underlies style transfers and metrics like the FID (which is used to judge generative networks' outputs on their similarity to the test set).

The idea is that the activations within an image recognition network are similar for images that are similar and so you can measure the distance between two images in a space that has some semantic meaning.


This is the concept behind "transfer learning", taking a fully-trained CNN (ResNet, Inception, MobileNet, etc) and removing the last or last two layers. What I don't understand yet is when to use which trained network, i.e. if they have different features that make them suited for different application domains.


I don't think there's a lot to say about this, you want the pretraining on a large dataset and on a task that's similar to yours. Probably largeness is somewhat more important. These things aren't an exact science at this point.


Any idea if there is something that supports animated gifs?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: