It is currently March 29th, 2024, 1:07 am



Post new topic Reply to topic  [ 6 posts ] 
Author Message
 Post subject: Filter Duplicates By URL.
PostPosted: December 23rd, 2021, 6:48 am 

Joined: November 23rd, 2014, 6:31 am
Posts: 276
Hi,

I am looking for a way/option to prevent TL to add messages that have identical URLs.

I am scraping some websites for posts and I have thousands of duplicate posts because the subjects have been slightly changed.

I would like a way to block for messages being added if identical URL is already in TL.

Alternatively I have to figure out a way to export all messages and filter out dupliates and then re-add into TL - but I dont know really how to do this.


Top
 Profile  
Reply with quote  
 Post subject: Re: Filter Duplicates By URL
PostPosted: December 23rd, 2021, 7:43 am 
Site Admin
User avatar

Joined: March 10th, 2011, 11:14 pm
Posts: 12630
Location: Earth
Hi,

subjects definitely don't change the same day multiple times. No one is doing that. Checking latest posts by subjects is enough.

When leeching pages it only checks the latest posts. If someone will change subject on very old post it won't matter, themaLeecher won't leech that anyway since it won't be in "Page 1" anymore.

Will add the duplicate URL filter.

_________________
themaPoster | themaCreator | themaManager | themaLeecher | themaRegister


Top
 Profile  
Reply with quote  
 Post subject: Re: Filter Duplicates By URL
PostPosted: December 23rd, 2021, 8:01 am 

Joined: November 23rd, 2014, 6:31 am
Posts: 276
Thanks Freddy,

I agree nobody change the subject that much or often.

The problem is more how I leech these messages - I have setup the subject labelling change and therefor the messages are multiple times in TL for me.

So basically its because of the way I get the msgs - not the way websites label subjects - and since I may be changing the subject multiple times it will be very nice with the url filter as a duplicate protection.


Top
 Profile  
Reply with quote  
 Post subject: Re: Filter Duplicates By URL.
PostPosted: December 23rd, 2021, 8:52 am 
Site Admin
User avatar

Joined: March 10th, 2011, 11:14 pm
Posts: 12630
Location: Earth
You can change the subject as many times as needed inside the program. That does not affect duplicate checking by subject. The program saves original subject internally and it's used for comparing (not the subject which you see in the program).

Will add the duplicate URL filter.

Edit:
4.50:
* Added "Duplicate URLs" filter in messages "Search and filter" window.

_________________
themaPoster | themaCreator | themaManager | themaLeecher | themaRegister


Top
 Profile  
Reply with quote  
 Post subject: Re: Filter Duplicates By URL.
PostPosted: January 2nd, 2022, 8:19 am 

Joined: November 23rd, 2014, 6:31 am
Posts: 276
Thanks Freddy,

Now I can filter them out after they are leeched - but as I understand this does not prevent TL from leeching messages with same URL as messages already existing in TL.

The problem for me is that there is a Video Price in the subject - and these sellers sometimes change the video price and then TL picks up the message again and then its a duplicate - just not by subject - but by URL.


Top
 Profile  
Reply with quote  
 Post subject: Re: Filter Duplicates By URL.
PostPosted: January 3rd, 2022, 8:22 am 
Site Admin
User avatar

Joined: March 10th, 2011, 11:14 pm
Posts: 12630
Location: Earth
Will take a look.

Edit:

4.51:
* Improved duplicates checking by URL.

_________________
themaPoster | themaCreator | themaManager | themaLeecher | themaRegister


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 6 posts ] 

Who is online

Users browsing this forum: No registered users and 17 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Theme designed by stylerbb.net © 2008
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
All times are UTC