Die-Hard Linux Bits-and-Bytes

Tuesday, August 28, 2007

Fighting E-mail Spam

Intro

Everyone knows that e-mail spam is a major problem. It affects individuals, governments, and corporations alike.

I am no exception to this rule, but in the past I took a very passive position on the issue: I had all my e-mail come into my GMail account and let the awesome Google spam filter deal with it. However, I was getting over 1200 spam messages per month, and I finally said, "Enough is enough! I am tired of these ****ing spam messages in my ****ing spam folder."

This is an account of my efforts to cut down on the spam, or at least deal with it in a more organized way. I don't know yet how effective these actions have been. Perhaps others can use this as a guide.

Action

The first thing I did in my crusade was to figure out where the spam is coming from. Spammers need to find out what my e-mail address is before they can spam it, so it must be exposed somewhere on a website.

I deactivated forwarding to my Gmail to seperate out spam to different aliases. I also did a search in my Gmail for "in:spam to:name@domain" to find spam messages that where sent to "name@domain" (once for each alias). This allowed me to see how spam differed between the different aliases.

Some interesting discoveries:

  • All the spam messages for the catch-all address of a domain I own were addressed to an unused alias. I just started dropping all messages to that alias.
  • A large portion of spam to my university e-mail account is coming through e-mail aliases associated with positions I hold in the Mathematics Society. Now I can address the larger problem of those aliases being exposed.

The remainder of the spam is from random sources to my main GMail address. A good way to see where your address is in plain text is to search for "name@domain" in Google. I found that my e-mail was displayed in the archives of mailing lists I used to participate in. In the future, I will be using a special e-mail account for mailing lists that requires the sender to confirm their identity (an auto-responder).

Other sources of trouble

Another place where my e-mail used to be exposed is the Whois information for some domains I own personally. I have since bought Hidden Whois service for those domains. I think it's worth the extra $5 USD / year.

Fighting Back

Sometimes I want to do more than hide from spammers; I want to take the fight to them! Fortunately, many people feel the same way and have organized SpamHelp.com as a front for the fight.

One interesting tool they offer is HarvesterKiller which generates an infinite cycle of pages with random e-mail addresses. They ask that people link to it to confuse e-mail harvesters.

The problem is that spammers can just black list this site and keep on harvesting. We need to create a simple CGI script (in Perl, PHP, etc.) that can be deployed on a website to easily generate such a spam bot trapper. Let's see them try to black list all the websites!

Resources

Sunday, July 08, 2007

Richard Stallman on Copyright and more

Intro

I've been planning to start blogging again for a few weeks, and the recent talk by Richard Stallman (R.M.S) given at the University of Waterloo (UW) provided the perfect excuse.

Background

Richard Stallman is the founder of the Free Software Foundation (FSF) and the GNU project. He is an outspoken advocate of copyright reform and a slew of other issues.

On R.M.S.

At first glance, Stallman appears like a less-impressive version of Lenin. If you read his personal site, he starts to sound like Lenin too. <rant>Lenin was a f***ing bastard, and as a result my expectations of Stallman were not too high</rant>. Dmitry even suggested that an argument between Stallman and me would be amusing.

Stallman has a very powerful presentation style. He commands history and facts to make a good argument. My only problem is his use of exaggeration like "evil corporations" and "horrible punishments"; this style makes him sound even more like Lenin.

The talk

My doubts about Stallman slowly evaporated throughout the duration of the talk. As promised, this talk was about copyright rather than free software. I didn't want to listen to 90 minutes of "Vista sucks, Free Software rules." However, Stallman took 5 minutes to express just how disappointed he was that people don't give "GNU" in "GNU/Linux" enough credit. His sentiment: if Linus is not with the FSF movement, then he is against it! I would have to disagree: just because you can't write a proper kernel, don't bash others who do!

R.M.S. went on to talk about the history of copyright: it started off as censorship, and was transformed into a law to protect authors from publishers who wanted to make money distributing their work. This would benefit the consumer by encouraging authors to publish more works. However, in the digital age anyone can be a "publisher" in the sense of being able to share information with many people. Therefore, copyright is starting to restrict the rights of the very consumers it was supposed to benefit.

Richard Stallman proposes a new system of copyright laws where works are seperated into 3 categories: practical (used to increase value to society), opinion (including scientific works), and works of entertainment (they themselves provide value). Each category should be treated seperately: works of practical value should be free as in "Free Software", opinion pieces must be free but protected from modifications to preserve their authenticity, and works of entertainment value would continue to have a (much reduced) period of copyright monopoly.

Finally, he touched on the topic of DRM-protected media and how it hurts the rights of the consumer. He also brought up a very disturbing point about how DRM is being applied to e-books, and how this affects the future of publishing.

My opinion

I can openly say I agree with Stallman's main message. Copyright gives record studios, movie studios, and publishers the unfair right to delay technological progress. If manufacturing physical CDs, DVDs, or books is not economical in our society, as these companies claim, then they should go find something else to do. R.M.S. made it very clear how little artists and authors actually get from the sales. As Larry Smith says, all government intervention has a high price (in terms of economics and human rights), and I am not willing to pay the price of copyright any more.

On the flip-side of the coin, I don't think the government should force all software to be free (in the FSF meaning of free), or all music to be DRM-free. The sale of propriatary software and the use of encryption are valid business practises. However, the DMCA should be abolished. We all know DRM is a big joke - every published song worth downloading is still available online. However, it makes big old-fashioned recording companies more comfortable with online music sales. Already we are seeing a shift to DRM-free music. Check out these articles:

Copyright in Canada

Richard Stallman said that he was glad to hear Canadians can "share songs freely online". However, his website links to the website describing a disturbing bill called "C60" which would introduce DMCA-like regulations to Canada. While this bill was dropped when the latest election was called, others are sure to follow. I am planning to sign the patition presented on the website; is anyone who lives locally interested in adding their signature? I am talking about physical signatures on paper.

Software as a Service

Someone beat me to the question about how all this applies to "software as a service" businesses like Google's web services and Microsoft's "Live" series. I was glad to hear Stallman is very clear on the subject: if you use software on someone else's computer then they own it, not you. We can't demand that Google share their source code or exercise any of the other software "freedoms". This finally convinced me that R.M.S. is more of a realist than people give him credit for.

Conclusion

I'll finish off with an R.M.S. quote: don't purchase DRM'ed media unless you can reliably break the DRM!

~ Anton

Labels: , , , , , ,

Monday, April 03, 2006

Some sanity in our DRM'ed world

Today Wired News ran an article titled Reasons to Love Open Source DRM. This article represents the first ray of hope in the land of Mordor... or rather Digital Rights Management. To summarize, Sun Microsystems is working on a new DRM system called DReaM that would have two major advantages over standard systems (from a user point of view):

  1. The rights would be associated with an individual person, not a particular device.
  2. The device licenses would be distributed by an independent standards body, and the software code itself would be opensource.

The very fact that this proposal is coming from a big open-source advocate rather than a hardware music player company is good news. It just makes sense that if I buy a song, I should be able to play it when I visit a friend or when I drive in a rented car. I just hope that Apple/Microsoft don't find a way to shut the project down. This will certainly shake things up in the digital world.

Thursday, March 09, 2006

Article - "Math Will Rock Your World"

Here is a very interesting article I came across in Business Week Online. It is about how mathematics are being applied to economics, marketing, information retrieval, and other aspects of the macroscopic world. I think it's important that people realize just how relevant mathematics are in our lives. Math is not just the realm of ultra-nerds and university professors; it touches every part of our life. Here is the article: http://businessweek.com/magazine/content/06_04/b3968001.htm

Thursday, February 09, 2006

GoogleTalk: Google goes head-to-head with Microsoft?

This morning Google activated the GoogleTalk feature in GMail. Now one can send e-mails and chat with people all from the same page.

This sounds familiar somehow... oh ya, MSN's Web Messanger & Hotmail combo.

So how is this different? Well, for one, we know that GMail is already superior to Hotmail in many ways (please, I don't want to go into this one). Furthermore, GoogleTalk has some big advantages over MSN, mainly the fact that it uses the Jabber protocol, which means that it works with many different messaging clients. Google doesn't even need to make GoogleTalk cross-platform; it already is.

This brings up a fundamental question; is Google really going head-to-head with Microsoft? It is obvious that they have been compatitors for a while, but is Google actually trying to match Microsoft or at least MSN service-for-service? With the rumors of a Google Browser or even a Google OS floating around, this doesn't seem so unlikely.

Is this a good thing? It depends on how Microsoft responds, and how long Google can maintain their open, users-come-first approach. If Microsoft responds by improving their products, great! But there is also the danger of a full-out war between the two giants, which could lead to some nasty unethical tactics (from both sides).

I guess we'll just wait and see.

Tuesday, February 07, 2006

How to build an AI using the net... today!

Here is an interesting thought that came to me today, and I have to write it down before I can go to sleep.

Disclaimer: The ideas presented here are mere hypothetical speculations; I will not be held responsible for any consequences that arise from the information contained in this article.


Introduction

You have probably seen the movie Terminator 3 where artificial intelligence arises from the complex interconnection of many "learning" or self-modifying systems. The idea of "neural networks" has also been in the news a lot recently; they are being used for everything from simulating the human brain to making very accurate "Related Text" matches on the Safari Bookshelf system.

The idea behind neural networks, is that there are many independent nodes, performing relatively simple computations in parallel, and communicating constantly with other nodes. In such a system, learning can be accomplished by modifying the "connections" between the nodes (i.e. what set of outputs a certain set of input to a node will produce).

The internet seems to be the natural choice for the basis of very large neural networks. What I will try to show here, is how existing technologies already poses all the essential components necessairy to build a neural network which spans the entire internet.

The SMTP Protocol

One of the oldest uses for the internet is e-mail. Most e-mail servers use some sort of routing system; a daemon that will accept all incoming SMTP requests, process the message based on certain parameters, optionally modify it, and then either store it on the local system, or re-transmit to another destination.

Notice the two keywords, "modify" and "re-transmit" in the paragraph above. In theory, two mail servers could keep bouncing messages back and forth, modifying them on each re-transmission. A node could also forward an e-mail message back to itself. The rules supported by current mail servers are very complex (regular expressions, parameter substitution, envoking external processes, etc.). Thus, each node in a network of mail servers could perform computations based on the content (or maybe just the headers) of an e-mail message, and pass them on to the next node.

Now imagine that one of the rules is to call an external process, which will in turn modify the rules the server uses for the next message, based on the content of the current one. Thus, the system gains the ability to "learn" by altering its responses to specific inputs.

It is not hard to imagine how to implement such a system, program it with a basic set of rules or "instincts", and have it learn and adapt based on the input. It may even be possible to add regular mail servers (i.e. without any special rules) to the system, by exploiting the default routing behaviour. The sheer number of such "dumb" servers would still make them computationally worth-while.

TCP/IP Packets

The same general idea applies to an even more basic and wide-spread system, the TCP/IP layer. Just like mail servers, most firewalls and routers now support an ever more complex set of rules. For example, the iptables module on Linux supports rules based on every possible field and even the content of the IP packet, as well as counters and other environmental parameters. Furthermore, the rules can easily be modified on-the-fly using a command-line utility.

Let's imagine a simple scenerio; a packet of size n arrives on port p. Since p is a non-standard port, it envokes a special iptables rule, which retransmits the packet to a pre-determined list of IP addresses on port n*p (mod 65535) with the size = to the total number of packets processed by the server (mod 1400). One can begin to imagine that you can build complex computational rules from such networks.

Another good example of a useful rule would be using each server/node as an m-to-n logical gate. For example, "if all of the last n inputs had size > 500, transmit a packet of size 1000 to m predetermined IP addresses, else transmit a packet of size 10 to these same addresses"; this implements an AND gate. Again, the sheer number of nodes, and number of possible links between the nodes creates staggering possabilities.

Once again, there is a possability of exploiting un-modified systems as part of such a network, but that it would be far more difficult, since retransmitting packets is not a very common rule in most firewall configurations. However, one could exploit the TCP/IP handshake for a 1-1 link with a "smart", pre-programmed node.

Conclusion

The main idea of this article, is that we don't need dedicated super-computer clusters to build neural networks. The ideas presented here are independent of the operating systems, installed programs, hardware archetectures, location, or computing power of the individual nodes. The computing power, communication infrustructure, and standards already exists to allow the creation of almost indefinitely-complex neural networks. Furthermore, such a network does not have to interfere with the main purpose of the node systems; the network can use out-of-bounds values to transmit information, without affecting "real" traffic on the network. The main question is how do we learn to use such system, and whether they will be used for good or evil. -- Anton

Thursday, February 02, 2006

Bayes rules in human mental processes

Here is a trully fascinating article I found recently (it's up on Slashdot too). This one's to do with psychology, but also artificial intelligence, so I thought it was appropriate: http://economist.com/science/displayStory.cfm?story_id=5354696 On Slashdot: http://science.slashdot.org/article.pl?sid=06/02/02/2343232 Funny quote from the article:
A frequentist way of doing things would reduce the risk of that happening. But by the time the frequentist had enough data to draw a conclusion, he might already be dead.

Saturday, January 21, 2006

Choosing a Linux Filesystems

With the purchase of a new hard drive comes the exciting activity of partitioning it for use, and with that, choosing the filesystem to use.

In the past, I have always used ext3 for my root partition, and played around with xfs and reiserfs for the other ones. Here is my subjective opinion:

  • Ext3 has always been rock-solid, easy to recover (even after creating a reiserfs filesystem *on top* of it), and has good backwards-compatability.
  • XFS was used on my DV editing partition, and had very good performance; it was designed with multimedia work in mind. My only complaint is that you can't shrink an XFS filesystem, and since I use LVM, that is annoying.
  • ReiserFS has been really fast when deleting many small files. However, I've had a reiserfs filesystem become corrupted *twice*, and I had no luck recovering the data. It probably has to do with the fact that ReiserFS does only metadata rather than full data journaling (see below).

So now I decided to do some research before formating the new drive. Here are some useful sites I found:

So what did I come up with? The two most interesting ones appear to be Reiser4 and XFS. I've had my eye on Reiser4 for a long time. It has a very radical design - something that's very interesting from a CS point of view. Visit the NAMESYS website to read all about it. Some benchmarks claim it is the fastest filesystem currently available. Namesys now has patches available for the latest non-mm kernels, so I will definitely try this one out, but I am not going to use it for important data.

XFS has most of the new advanced features described in the Linux Gazette article, but with the added bonus of almost 10 years of good track record. I think I will be using it for most of my files, but ext3 is still my filesystem of choice for my root partition.

I hope this will help someone who needs to make a similar decision. Choosing a filesystem is like choosing a major - you have to do a lot of extra work to change it. Thus choose wisely. I would also very strongly recommend using LVM, even if you have only one disk; it eliminates some of this extra work if you ever have to switch a filesystem.

SmoothSlideSaver for KDE

I just found this great screensaver for KDE; it uses OpenGL to create a very beautiful slideshow of images. Even on my crappy ATI Radeon 7000, it runs perfectly at 1400x1050.

http://www.kde-apps.org/content/show.php?content=33197

This is actually the first screensaver I have used in over 6 months. I am really bored of all those flying lines, matrix simulations, and other fancy graphics. This just seems like a very simple, clean, and professionally done screensaver.

First Post - Hello World!

Hello everyone,

Welcome to my blog, "Die-Hard Linux Bits-and-Bytes". This is a place where I get to post all the lastest and exciting linux- and tech-related news, debate the latest controvesies, and generally endulge in tech-talk. I hope you will enjoy your stay.

The next two articles you will find above were originally posted on my personal blog a few weeks ago, but I copied it here for consistency. For those of you who are curious, you can find my personal blog in my blogger.com profile.

On a side note, I am always looking for interesting co-op employment opportunities in the areas of software development and computer science research. You will find my resume on the right sidebar.