Hash Or Watermark?

Slip · November 20, 2008

Heya lads n lasses.

I'm looking for some help and wonder if any techies can offer advice.

What I want to do is identify files that come from a particular computer (or even better a particular user). I'm teaching and the wee boogers keep copying each others files and just changing the name. Some programs have good properties which allow you to tell the author- others do not. What I'm after doing is stamping each file automatically with an identifier that cannot be changed, regardless of the program that created it.

Most of the identifier searching I've been able to turn up centres around the image/ music industries (watermarking) or hash identifiers (integrity checking) which don't strike me as what I want really. Hashing seems to do just the opposite of what I want- if the file is changed at all the hash will alter showing the change. I want something that says despite any changes that have been made this is orignally file xyz produced by computer 10 (or user Ploy). Similarly visual watermarking isn't going to be any use really for databases for example.

Does anyone know of anything freely available or has anyone got any ideas? (I realise of course that this won't stop them using cut and paste where-ever they can or finding other loopholes- sneaky little blighters).

Ta

And also thanks from all the students in the future who might actually learn something because they have to think, rather than just copying their mate because they're too idle to be arsed.

BangkokBP · November 20, 2008

Hi Slip,

Steganography might be a solution, I have never tried it for this kind of application but it might work.

More info here

http://en.wikipedia.org/wiki/Steganography

good luck,

bbp

Slip · November 20, 2008

Hi BangkokBP-

Thanks for the reply. I've bumped up to Stenography a couple of times in my search but not pursued it. Strange really as I had a bit of a thing about it when I was younger and more revolutionary and didn't like the fact that governments can read my mail. (Since then, of course, I've realised that's it's boring and no-one would be interested.

Thanks for the link- I'll check it out

BangkokBP · November 20, 2008

Hi BangkokBP-
Thanks for the reply. I've bumped up to Stenography a couple of times in my search but not pursued it. Strange really as I had a bit of a thing about it when I was younger and more revolutionary and didn't like the fact that governments can read my mail. (Since then, of course, I've realised that's it's boring and no-one would be interested.

Thanks for the link- I'll check it out

I am not sure it would work at all, depending on the type of files, how they are edited and submitted etc.

From your description of the problem you want to prevent plagiarism between the students (http://en.wikipedia.org/wiki/Plagiarism_detection) but if the work is submitted to, and reviewed by you, don't you notice when two works are similar enough to be copied? or they edit them enough to escape detection?

Just curious,

bbp

Slip · November 20, 2008

I am not sure it would work at all, depending on the type of files, how they are edited and submitted etc.
From your description of the problem you want to prevent plagiarism between the students (http://en.wikipedia.org/wiki/Plagiarism_detection) but if the work is submitted to, and reviewed by you, don't you notice when two works are similar enough to be copied? or they edit them enough to escape detection?

Just curious,

bbp

Yes, thanks bbp, haveing looked at it again I'm not sure it's helpful except by some very obscure and hard to execute application.

But, yeah; it's anti-plagarism. As I have to mark say 25 pieces of work the 'notice by eye' technique which means I then have to go and review (sometimes 5-6 times (lol 23-25 times)) the work to round up the guilty, and that's manually looking for the differences, is time consuming and rarely ultimately justified- just do a round up twice a term to catch thte kids who think they can get away with it. Programs like the office suite for example store info like the original author , but most do not, then you may have to look for extra spaces, diffent capital letters for example, and that's just hassle. If they are clever enough to edit the files they're uncatchable- you can only prove files that are identical really.

I've encouraged my students to realise that I know all and they can't get one over on me,which seems to do wonders for their motivation (to learn bad stuff-lol), but my defences are becoming stretched and I really would like a quick easy way to stamp a file so I know where it's come from.

Thanks for that inspiration about 'anti-plagiarism' tho- I've been thinking to much about hash codes and watermarks and searching on all the wrong things.

Veazer · November 20, 2008

For a free solution comparing 2 documents you might try WinMerge. It works for text based docs like programming code, database dumps, etc. It will highlight portions that are different, and thus if you see unhighlighted portions then you know they are the same. I think it might be too precise for you though, it is looking for lines of code that are character for character the same.

It also depends if you want to only compare your students to eachother or check if they are using online available essays, etc.

You might try some of the software specifically aimed at plagiarism. This looks well suited to your needs:

http://www.anticutandpaste.com/antiplagiarist/

Slip · November 21, 2008

Thanks Veazer -I'll give them a whirl.

autonomous_unit · November 21, 2008

It would help to know what kind of files are in question. But regardless, the problem is quite difficult. You are on the right track to think about comparing things for similarities instead of trying to mark the file creation in some way.

Someone mentioned earlier that you should recognize the duplication when grading. This, unfortunately, is the best answer. The problem of detecting pairwise similarities in a set of many students becomes very expensive, so having your brain do the work is the most practical. Use the file-comparison program to help convince yourself how similar two files are after you've noticed them.

Imagine you didn't even read the files first. For a class of 25 students, each submitting one file, you would have to consider 300 possible pairings of different files to brute-force check all of them! Are you going to launch a comparison program 300 times for one batch of student work? That would be 24 hours of work if it takes 2 minutes for each comparison. In university, we faced this issue in lectures with 200 students which leads to about 20 thousand possible pairings... (those up on their maths call this particular calculation "N choose 2" where N is the number of students).

It's an interesting problem that brings out the obsessive compulsiveness of some technical people. But in practice, once you detect similarities you still have a human problem of distinguishing who (if anybody) actually cheated. Sometimes a poorly designed assignment leads many students to independently write almost the same answer, leading to false accusations if you are not careful. Good luck.

Plus · November 21, 2008

What about taking better control of the production process rather than comparing results?

Let's say students can submit their assignments only through an online portal that doesn't allow cut and paste of any kind, something like Flash that doesn't take right clicks, just build a simple submit form and that's it.

On the server side you can run a script that compares each submission to previous entries, in case someone types up a copy from the scratch.

The script doesn't have to be very complex - just take a few random sample strings, maybe 100 characters long, and see if they popup anywhere else.

Slip · November 21, 2008

Across the grades I use many different programs -the visual ones are no problem as you would immediately see. At the moment I'm using logo with some of my kids. The office family provides data about the author so make it easy (my students don't seem to have sussed you can change this yet)- as much as anything it's useful to be able to point out the evidence when you get the inevitable 'but I didn't copy teacher'.

Pretty well at the moment I guess I do as you are saying, in that I will spot jarring similarities when I mark so then have to go back and review which is sometimes time consuming and a bit irritating to be honest. That's why the information already provided by some programs is very useful- I can check the authors very quickly once I notice similarities. Sadly not all programs provide good properties data. As you suggest the comparison progs should speed up that process, especially for the logo work I'm doing now, as I expect the code to be very similar but any little common mistakes should turn up rather easily. But yes I'm only looking for files that are identical and then when I notice mistakes or anomalies common to both I can guess that this is a save-out version of someone elses file. I don't want to give the impression I'm obsessing about this ,lol,I'm not really hunting for copying but sometimes it's obvious, it just doesn't seem very scientific for computers to have to 'do it by hand'.

Anyway thanks for all the replies everyone.

Plus · November 21, 2008

I don't know your actual situation, your abilities or facilities, or even what this "logo thing" is.

At this point you are trying to solve the problem, which is fine, but how about preventing the problem in the first place?

Can you control how students access their files - why is it so easy to open someone else file and "save as" it? That would leave a "watermark", sure. Maybe it's enough for now - you can just scan the files for the same "watermark", some sort of meta information about the file not easily visible to the user, or whatever it is.

It will be more difficult with copy-paste 'cos that doesn't pick up any meta data, and plain text can be completly reformatted.

In that case you'd need a program to read files line by line and compare them. Great if you find one that can do it automatically, but the custom made script to do it should be really simple if you find a programmer.

Actually your own scripts would pick up your watermarks, too. That would be faster.

Slip · November 21, 2008

Ha ha, thanks Plus- your post got me thinking.

I'm in favour of writing up a flash based assessment module, it's on my rather long to do list. I've made some flash based interactive tutorials (5 grades' worth) but developing the testing uses has so far eluded me- I can see it's mostly action-scripting. Unfortunately once I get into scripting I'm pretty clueless. I can script buttons or events in flash but nothing fancy. But you say on the server side so what would you use, not flash I'm presuming?

The students all save their files to a common folder on the LAN, I could make them save to a separately passworded folder of their own but it seems clumsy and when it came to harvest the files I'd be rooting around in 20 odd sub-folders. Again you might tell me I could do this with some sort of script- this is an area I badly need to research. Do you know of any handy links? Please PM me if you can't post them here.

Funnily enough Logo is a programming language and interface with a bias towards children. (Actually it's pretty cool- if there are any parents out there it's worth checking out- just google 'MSW Logo' and there's tons of good guidance materials out there too). Theoretically I imagine I should be able to write something in that to do the job, but the smarts just ain't there.

But as I was chewing over your post it made me remember something which I was stupid to forget. I copy the students' files from a central server to my memory stick so I can lounge in my chair at home to do my marking- when I do that all useful metadata is often wiped (well that's my experience of other programs anyway). Having checked logo it looks like its meta-data records are reasonably good. Although on my home pc it's not actually recorded anything- just lots of blank boxes.

I'm sure a programmer could help me but I suffer from this irritating need to re-invent the wheel. I think I understand what you're suggesting with a script that watermarks but I don't think I have anything like the skill to do it (yet) sadly! Any resource pointers would be extremely gratefully received.

Thanks again everyone for interesting, useful replies. I haven't solved the problem entirely to my liking yet but I think I have a work-around and some avenues to explore at my leisure.

dave_boo · November 25, 2008

Quick and dirty way to do it is run both through 'diff' if you're on a *nix. If there's quite a bit of similarity, you'll see the result spit out an extremely small file. On the other hand, if there's a little similarities, it will be the original file size more or less. This is easily defeated though simply by changing up the paragraphing if the students know that you're doing it.

Sign In

Hash Or Watermark?

Recommended Posts

Slip

BangkokBP

Slip

BangkokBP

Slip

Veazer

Slip

autonomous_unit

Plus

Slip

Plus

Slip

dave_boo

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Topics

Popular Contributors

Latest posts...

UK Mohammed Fahir Amaaz & Muhammad Amaad on Trial over Manchester Airport Clash

Politics Anutin Denies Eyeing Thai PM Role Amidst Political Rumours

Report Thai Police Dismiss British Teen’s Drug Smuggling Claims

Life Coaches and Other Professional Time Wasters ~ Who’s Buyin’ This Minging Rubbish?

Politics Anutin Denies Eyeing Thai PM Role Amidst Political Rumours

UK Mohammed Fahir Amaaz & Muhammad Amaad on Trial over Manchester Airport Clash

Popular in The Pub

ASEANNOW

MORE INFO

POPULAR AREAS

CONTACT US

Thailand

Support

Activity

My Activity Streams