Jump to content

Hash Or Watermark?


Slip

Recommended Posts

Heya lads n lasses.

I'm looking for some help and wonder if any techies can offer advice.

What I want to do is identify files that come from a particular computer (or even better a particular user). I'm teaching and the wee boogers keep copying each others files and just changing the name. Some programs have good properties which allow you to tell the author- others do not. What I'm after doing is stamping each file automatically with an identifier that cannot be changed, regardless of the program that created it.

Most of the identifier searching I've been able to turn up centres around the image/ music industries (watermarking) or hash identifiers (integrity checking) which don't strike me as what I want really. Hashing seems to do just the opposite of what I want- if the file is changed at all the hash will alter showing the change. I want something that says despite any changes that have been made this is orignally file xyz produced by computer 10 (or user Ploy). Similarly visual watermarking isn't going to be any use really for databases for example.

Does anyone know of anything freely available or has anyone got any ideas? (I realise of course that this won't stop them using cut and paste where-ever they can or finding other loopholes- sneaky little blighters).

Ta

And also thanks from all the students in the future who might actually learn something because they have to think, rather than just copying their mate because they're too idle to be arsed.

Link to comment
Share on other sites

Hi BangkokBP-

Thanks for the reply. I've bumped up to Stenography a couple of times in my search but not pursued it. Strange really as I had a bit of a thing about it when I was younger and more revolutionary and didn't like the fact that governments can read my mail. (Since then, of course, I've realised that's it's boring and no-one would be interested.

Thanks for the link- I'll check it out

Link to comment
Share on other sites

Hi BangkokBP-

Thanks for the reply. I've bumped up to Stenography a couple of times in my search but not pursued it. Strange really as I had a bit of a thing about it when I was younger and more revolutionary and didn't like the fact that governments can read my mail. (Since then, of course, I've realised that's it's boring and no-one would be interested.

Thanks for the link- I'll check it out

I am not sure it would work at all, depending on the type of files, how they are edited and submitted etc.

From your description of the problem you want to prevent plagiarism between the students (http://en.wikipedia.org/wiki/Plagiarism_detection) but if the work is submitted to, and reviewed by you, don't you notice when two works are similar enough to be copied? or they edit them enough to escape detection?

Just curious,

bbp

Link to comment
Share on other sites

I am not sure it would work at all, depending on the type of files, how they are edited and submitted etc.

From your description of the problem you want to prevent plagiarism between the students (http://en.wikipedia.org/wiki/Plagiarism_detection) but if the work is submitted to, and reviewed by you, don't you notice when two works are similar enough to be copied? or they edit them enough to escape detection?

Just curious,

bbp

Yes, thanks bbp, haveing looked at it again I'm not sure it's helpful except by some very obscure and hard to execute application.

But, yeah; it's anti-plagarism. As I have to mark say 25 pieces of work the 'notice by eye' technique which means I then have to go and review (sometimes 5-6 times (lol 23-25 times)) the work to round up the guilty, and that's manually looking for the differences, is time consuming and rarely ultimately justified- just do a round up twice a term to catch thte kids who think they can get away with it. Programs like the office suite for example store info like the original author , but most do not, then you may have to look for extra spaces, diffent capital letters for example, and that's just hassle. If they are clever enough to edit the files they're uncatchable- you can only prove files that are identical really.

I've encouraged my students to realise that I know all and they can't get one over on me,which seems to do wonders for their motivation (to learn bad stuff-lol), but my defences are becoming stretched and I really would like a quick easy way to stamp a file so I know where it's come from.

Thanks for that inspiration about 'anti-plagiarism' tho- I've been thinking to much about hash codes and watermarks and searching on all the wrong things.

:o

Link to comment
Share on other sites

For a free solution comparing 2 documents you might try WinMerge. It works for text based docs like programming code, database dumps, etc. It will highlight portions that are different, and thus if you see unhighlighted portions then you know they are the same. I think it might be too precise for you though, it is looking for lines of code that are character for character the same.

It also depends if you want to only compare your students to eachother or check if they are using online available essays, etc.

You might try some of the software specifically aimed at plagiarism. This looks well suited to your needs:

http://www.anticutandpaste.com/antiplagiarist/

Link to comment
Share on other sites

It would help to know what kind of files are in question. But regardless, the problem is quite difficult. You are on the right track to think about comparing things for similarities instead of trying to mark the file creation in some way.

Someone mentioned earlier that you should recognize the duplication when grading. This, unfortunately, is the best answer. :D The problem of detecting pairwise similarities in a set of many students becomes very expensive, so having your brain do the work is the most practical. Use the file-comparison program to help convince yourself how similar two files are after you've noticed them.

Imagine you didn't even read the files first. For a class of 25 students, each submitting one file, you would have to consider 300 possible pairings of different files to brute-force check all of them! Are you going to launch a comparison program 300 times for one batch of student work? That would be 24 hours of work if it takes 2 minutes for each comparison. In university, we faced this issue in lectures with 200 students which leads to about 20 thousand possible pairings... (those up on their maths call this particular calculation "N choose 2" where N is the number of students).

It's an interesting problem that brings out the obsessive compulsiveness of some technical people. But in practice, once you detect similarities you still have a human problem of distinguishing who (if anybody) actually cheated. Sometimes a poorly designed assignment leads many students to independently write almost the same answer, leading to false accusations if you are not careful. Good luck. :o

Link to comment
Share on other sites

What about taking better control of the production process rather than comparing results?

Let's say students can submit their assignments only through an online portal that doesn't allow cut and paste of any kind, something like Flash that doesn't take right clicks, just build a simple submit form and that's it.

On the server side you can run a script that compares each submission to previous entries, in case someone types up a copy from the scratch.

The script doesn't have to be very complex - just take a few random sample strings, maybe 100 characters long, and see if they popup anywhere else.

Link to comment
Share on other sites

Across the grades I use many different programs -the visual ones are no problem as you would immediately see. At the moment I'm using logo with some of my kids. The office family provides data about the author so make it easy (my students don't seem to have sussed you can change this yet)- as much as anything it's useful to be able to point out the evidence when you get the inevitable 'but I didn't copy teacher'.

Pretty well at the moment I guess I do as you are saying, in that I will spot jarring similarities when I mark so then have to go back and review which is sometimes time consuming and a bit irritating to be honest. That's why the information already provided by some programs is very useful- I can check the authors very quickly once I notice similarities. Sadly not all programs provide good properties data. As you suggest the comparison progs should speed up that process, especially for the logo work I'm doing now, as I expect the code to be very similar but any little common mistakes should turn up rather easily. But yes I'm only looking for files that are identical and then when I notice mistakes or anomalies common to both I can guess that this is a save-out version of someone elses file. I don't want to give the impression I'm obsessing about this ,lol,I'm not really hunting for copying but sometimes it's obvious, it just doesn't seem very scientific for computers to have to 'do it by hand'. :o

Anyway thanks for all the replies everyone.

Link to comment
Share on other sites

I don't know your actual situation, your abilities or facilities, or even what this "logo thing" is.

At this point you are trying to solve the problem, which is fine, but how about preventing the problem in the first place?

Can you control how students access their files - why is it so easy to open someone else file and "save as" it? That would leave a "watermark", sure. Maybe it's enough for now - you can just scan the files for the same "watermark", some sort of meta information about the file not easily visible to the user, or whatever it is.

It will be more difficult with copy-paste 'cos that doesn't pick up any meta data, and plain text can be completly reformatted.

In that case you'd need a program to read files line by line and compare them. Great if you find one that can do it automatically, but the custom made script to do it should be really simple if you find a programmer.

Actually your own scripts would pick up your watermarks, too. That would be faster.

Link to comment
Share on other sites

Ha ha, thanks Plus- your post got me thinking.

I'm in favour of writing up a flash based assessment module, it's on my rather long to do list. I've made some flash based interactive tutorials (5 grades' worth) but developing the testing uses has so far eluded me- I can see it's mostly action-scripting. Unfortunately once I get into scripting I'm pretty clueless. I can script buttons or events in flash but nothing fancy. But you say on the server side so what would you use, not flash I'm presuming?

The students all save their files to a common folder on the LAN, I could make them save to a separately passworded folder of their own but it seems clumsy and when it came to harvest the files I'd be rooting around in 20 odd sub-folders. Again you might tell me I could do this with some sort of script- this is an area I badly need to research. Do you know of any handy links? Please PM me if you can't post them here.

Funnily enough Logo is a programming language and interface with a bias towards children. (Actually it's pretty cool- if there are any parents out there it's worth checking out- just google 'MSW Logo' and there's tons of good guidance materials out there too). Theoretically I imagine I should be able to write something in that to do the job, but the smarts just ain't there.

But as I was chewing over your post it made me remember something which I was stupid to forget. I copy the students' files from a central server to my memory stick so I can lounge in my chair at home to do my marking- when I do that all useful metadata is often wiped (well that's my experience of other programs anyway). Having checked logo it looks like its meta-data records are reasonably good. Although on my home pc it's not actually recorded anything- just lots of blank boxes.

I'm sure a programmer could help me but I suffer from this irritating need to re-invent the wheel. I think I understand what you're suggesting with a script that watermarks but I don't think I have anything like the skill to do it (yet) sadly! Any resource pointers would be extremely gratefully received.

Thanks again everyone for interesting, useful replies. I haven't solved the problem entirely to my liking yet but I think I have a work-around and some avenues to explore at my leisure.

Link to comment
Share on other sites

Quick and dirty way to do it is run both through 'diff' if you're on a *nix. If there's quite a bit of similarity, you'll see the result spit out an extremely small file. On the other hand, if there's a little similarities, it will be the original file size more or less. This is easily defeated though simply by changing up the paragraphing if the students know that you're doing it.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.










×
×
  • Create New...