themaLeecher
http://leecher.themasoftware.com/forum/

Filter Duplicates By URL.
http://leecher.themasoftware.com/forum/viewtopic.php?f=3&t=6635
Page 1 of 1

Author:  Pablo01 [ December 23rd, 2021, 6:48 am ]
Post subject:  Filter Duplicates By URL.

Hi,

I am looking for a way/option to prevent TL to add messages that have identical URLs.

I am scraping some websites for posts and I have thousands of duplicate posts because the subjects have been slightly changed.

I would like a way to block for messages being added if identical URL is already in TL.

Alternatively I have to figure out a way to export all messages and filter out dupliates and then re-add into TL - but I dont know really how to do this.

Author:  Freddy [ December 23rd, 2021, 7:43 am ]
Post subject:  Re: Filter Duplicates By URL

Hi,

subjects definitely don't change the same day multiple times. No one is doing that. Checking latest posts by subjects is enough.

When leeching pages it only checks the latest posts. If someone will change subject on very old post it won't matter, themaLeecher won't leech that anyway since it won't be in "Page 1" anymore.

Will add the duplicate URL filter.

Author:  Pablo01 [ December 23rd, 2021, 8:01 am ]
Post subject:  Re: Filter Duplicates By URL

Thanks Freddy,

I agree nobody change the subject that much or often.

The problem is more how I leech these messages - I have setup the subject labelling change and therefor the messages are multiple times in TL for me.

So basically its because of the way I get the msgs - not the way websites label subjects - and since I may be changing the subject multiple times it will be very nice with the url filter as a duplicate protection.

Author:  Freddy [ December 23rd, 2021, 8:52 am ]
Post subject:  Re: Filter Duplicates By URL.

You can change the subject as many times as needed inside the program. That does not affect duplicate checking by subject. The program saves original subject internally and it's used for comparing (not the subject which you see in the program).

Will add the duplicate URL filter.

Edit:
4.50:
* Added "Duplicate URLs" filter in messages "Search and filter" window.

Author:  Pablo01 [ January 2nd, 2022, 8:19 am ]
Post subject:  Re: Filter Duplicates By URL.

Thanks Freddy,

Now I can filter them out after they are leeched - but as I understand this does not prevent TL from leeching messages with same URL as messages already existing in TL.

The problem for me is that there is a Video Price in the subject - and these sellers sometimes change the video price and then TL picks up the message again and then its a duplicate - just not by subject - but by URL.

Author:  Freddy [ January 3rd, 2022, 8:22 am ]
Post subject:  Re: Filter Duplicates By URL.

Will take a look.

Edit:

4.51:
* Improved duplicates checking by URL.

Page 1 of 1 All times are UTC
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/