Plagiarism and content theft have really intrigued me of late. I got to thinking and came up with an idea for a very simple digital fingerprint for online content. Since content scrapers simply scrape, paste, and profit, they (ideally) wouldn’t notice a small ‘fingerprint’ (a keyword / phrase or personal googlewhack) sandwiched in with the regular text of your content. It follows that your fingerprint is uniquely yours, only those pages and posts where you placed your fingerprint should actually contain your fingerprint. A simple search via google or other search engines could confirm this on a regular basis. If your fingerprint exists on the web it is either coincidence or plagiarism.

Taking this idea, I have turned it into a conceptual wordpress plugin that would be easy to use for your average blogger. Here is what I see as the major components of that plugin:
1. Add invisible class using css;
2. Add back end menu to allow the user to create a personal digital fingerprint;
3. Add button to wordpress post editor to insert fingerprint where needed in posts;
4. Create a rss feed for the search fingerprint (via google);
5. Subscribe to search feed (possible display feed in wordpress backend).
The questions:
Is this a good idea?
Would theory and practice be close enough to make this type of application useful?
What limitations would there be?
Rss feeds would need to have the fingerprint in them, therefore all rss subscribers would also see the fingerprint. How would you address this?
Your ideas, thoughts, rants, musings would be appreciated.
01 May 06
7:12 pm
I think something like that would be good. While I have used content from other sites, I only quote the article, link to the original and rarely do I steal an entire post from another blogger.
01 May 06
8:23 pm
I, too, am a huge believer in digital fingerprints! Numly allows authors to fingerprint their uploaded works such as images, music, documents, etc. We use this fingerprint to associate the work with the author. This way, an orphaned work can be reassociated with the author and anyone wishing to use an image that they found on the web can upload it to Numly and determine if has an associated Numly Number thus an author/artists that can be contacted.
I like the idea of digitally fingerprinting content in blogs and rss feeds as well. This is a similar idea to the micro id concept. We would be willing to help in anyway that we could. Feel free to contact me at chris at numly.com.
01 May 06
9:22 pm
Ok, Ill take a few cracks at this.
First, the idea seems fairly sound. I’ve seen similar ideas discussed elsewhere and generally fallen behind it. However, there are a few things to consider before launching into it.
For starters, you don’t want to reinvent the wheel. Numly Numbers already provide digital fingerprinting. Each ESN will be unique to that entry and, since you already use the Numly plugin, it’s trivial to add the numbers to your feed.
Second, there are already feed copyright plugins that append unique information to each entry in the feed. You can customize this to your liking, adding any kind of strange term you want.
Third, I have to wonder how much is gained versus just using statistically improbable phrases from the work itself. While having a fingerprint might simplify the process, if the fingerprint is omitted, the simplicity is for naught.
Fourth, using an invisible CSS could injure you in several ways. First, it could enable scrapers to skip it as many do now look at the page they’re scraping. Second, it wouldn’t inhibit copy and paste plagiarism. Finally, search engines tend to penalize sites that hide text, any text. How much of a penalty I can’t say, but I’ve seen it happen before.
On that note, the idea has a lot of merit. I do have to warn though that, if you’re automatically generating Google searches, you might want to make sure you’re not violating the TOS. Several have gotten in trouble for that.
Personally, I like the idea and would be interested in playing around with this plugin.
Hope that you are well!
01 May 06
11:46 pm
Great comments and thoughts thus far. Firstly, I don’t want to reinvent the wheel. If something exists out there that can work — I’d choose that. While I really like numly numbers, the current implementation (at least as I have it configured) doesn’t have the number as part of the content and the numbers aren’t part of the RSS feed. Again, this could be the way I have it set up.
Irregardless, a numly number looks out of place. Any human content scraper will easily be able to determine that they probably shouldn’t copy it too.
Feed copyright plugins that add custom text sound like a good idea. My only concern is that if your fingerprint is only in your feed, that doesn’t help you find the sourcecode scraper or a cut and paster.
Using a statistically improbably phrase is a good idea. However, as Jonathen points out it requires a SIP for each piece of content you make. Managing your SIP database and searching for each one on any kind of regular basis will be a challenging endeavour.
Invisible text is generally considered bad form (and for good reason). However, there is precedence for having invisible text. Look at the source code for therapistfinder.net (thats ‘therapist finder’, not ‘the rapist finder’ as I first read it). Invisible text is used for speech readers and accessibility (see seologic for an explanation).
There would be no violation of the google TOS if the end users sign up for an API key. See Google Search to RSS using SOAP API for an example script.
Also, google blog search already has rss feeds for search. Here is a
Google Blog Search RSS for ‘digital fingerprint’. This could be good for finding bloggers at least.
You all raise valid issues and so far, it seems you guys like the idea in principle. Going from idea to the ‘right’ idea to code to practice is a tricky undertaking. Many hands make light work.
Another idea just occurred: how about additional textarea inserted in the wordpress post editing screen where you can enter a SIP or phrase from your content. On publish, a link (visible only to the admin and similar to the ‘edit’ links used on many themes) to the google search for the SIP is created. Now, anytime the author wants to check if the post may have been plagiarised, they just click the SIP link. Hrmmmm…
02 May 06
9:35 am
First off, it’s pretty trivial to put ESNs into your RSS feed. While it requires a little bit of extra effort to edit the Wordpress RSS feed, you have to change permissions on the RSS template file and may have to edit it by hand, it can be done with little knowledge.
All you have to do is input the exact same Numly template tag you used for your regular template and place it within the content section of the post. You will probably have to edit the plugin to remove the DIV element (so it will validate) and you can then modify it so that it blends in better. You could theoretically, change the number itself to text and just link to the validation page, for example.
The Numly plugin is pretty easy to do these kinds of edits on, well made in that regard especially. I’m the worst PHP programmer out there but it only took me a few minutes to switch to the new Numly server when the change happened.
Regardless, the problems you list aren’t unique to any digital fingerprinting system, but to all digital fingerprints. Pretty much every system out there is in danger of being hacked off either by machines or by humans. Numly Numbers, SIPs, even your fingerprints can all be hacked off. Right now some sploggers are stealing only the first few sentences of an entry, even if the full feed is available, others ignore everything inside special CSS structures and still others remove ALL code from the post.
There’s no way to completely secure any fingerprint against being hacked off, either on accident or by intent.
Also, all methods suffer from the same problem of maintaining large database. Whether you use SIPs, ESNs or tags, your database grows quickly and creates problems. One post a day for a year creates 365 entries in a database that have to be tracked. No small feat no matter the format.
Without some way to cull old and unneeded entries, even with RSS feeds, the process could get very taxing.
Finally, we might want to look into working with Feedburner. Not only do they track uses of a feed, including scrapers, but their FeedFlare service has an interesting API that might be useful.
Personally though, I don’t think that there is any one right method for handling this. All methods have advantages and disadvantages but doubling up risks doubling the burden.
I am excited about some companies that I’m hearing about that are using new methods to protect RSS feeds and posts. But their services are probably weeks, if not months away.
In the meantime, there’s no easy route to protection and, though I still love the ideas you present, especially the one you mentioned about adding an SIP box, we have to realize that total protection isn’t possible and all systems will have flaws.
10 Jun 06
3:17 pm
[...] There’s also a goldmine of content about digital fingerprinting. [...]
20 Jan 08
8:21 am
To not confuse my readers, I tried ‘commenting-out’ the fingerprint, so it’s not visible, making it small (if commenting out didn’t work), and adding line breaks so the fingerprint is in a stand-alone paragraph, not stuck up against the tail end of the first paragraph.
In Feeds, none of my code is treated as code - it is all displayed.
Is there a way to feeds to treat code as code?
I hope the code will display below
`[[this post from example.com]]`20 Jan 08
8:24 am
the code didn’t display, despite using backticks and
tags, so here it is again, this time with extra spaces around the :[[this post from example.com]]
20 Jan 08
8:27 am
Despite adding spaces around the less than and greater than signs (is there a coding-related term for those?), the code still did not display.
So now I’ve removed the greater than and less than signs:
br/ br/ small [[this post from example.com]] /small !–test_-_content_taken_from_example.com–
27 Jul 09
11:56 pm
[...] Website Maxpower Digital fingerprints for content — would this help against plagiarism? Found this interesting – Saved link here for later reference. [...]