Jim said:
Maybe I haven't been clear enough.....I will try again.
I want to have a VB.Net 2005 coded proxy that is multi-threaded to accept
more than one client at a time.
I want to be able to scan html pages for objectionable content (be it adult
subject matter or advertisements) and remove them from the HTML shown in the
browser before the HTML is given to the browser.
I am only interested in scanning HTTP/HTTPS. I do not want to intercept any
other data streams.
This proxy may run as a server on the web for people to point to as they so
desire to remove unwanted elements from their surfing, or it may be freely
distributed to run locally on a person's PC or network.
Is that a better explanation?
Jim
It's not that you haven't been clear (except a few minor details).
If there's any explanation needed, it's why you even want to do this.
For yourself (instead of using an existing proxy)? As a commercial
product (good luck with that)? Is it supposed to have some special
feature? Because right now it seems like you're almost reinventing the
wheel, and without any real reasons (and perhaps not using the best
language/tools for the job either; most such projects are usually in
MFC/C++ or such)...
Honestly, I don't think you're seeing the full extent of such a project.
Running a TcpListener as a service doesn't seem hard at first, but
then you gotta consider other things:
-Filtering; you want to do 2 types of filtering: by address and by
content. For the address you will need some block lists (by IP and
domain name). Parsing this will can be costly (in terms of time spent),
especially as the list grows (they're usually very long). You will have
to try and benchmark various ways to do this, profile and optimize code
a lot. For content filtering, this is only going to be harder (and
perhaps slower). There may be some challenges you didn't expect either
(tricks like people escaping some characters like s%65xsite.com and what
not)
-You will have to do everything you can to keep latency to a minimum
(from the added network latency you can't get rid of, plus all your
normal processing and filtering). And on the other hand, you will want
to keep CPU load to a minimum on the computer serving as a proxy as the
traffic/requests increases (not an easy thing to do here, considering
language/platform choice too). Keeping memory usage down may be a
significant problem too (lots of requests at the same time, lots of data
to check against, etc)
-Update/Maintain block lists. That alone is more than a full time job. I
don't know how you expect to make and update these... Many companies
employ several people just to maintain these for their own products.
-Distributing updates. Yes, you'll have to have some update mechanism
for your updates. At least for the block lists, but preferably also for
the program (without requiring reboots or downtime). Don't expect to
never have to update your program, it WILL happen, so you have to plan
accordingly.
-Maintenance should be very minimal. That includes updating. IT
departments nowadays don't want extra overhead/things to babysit/look
after. Which also brings me to the next point:
-Stability. This thing should be rock stable - never EVER crash. Same
story for memory leaks. This is *critical* but it won't be exactly easy.
When it crashes (or starts consuming a lot of resources), people have to
go fix the problem (see previous point), users (clients) get frustrated
with your product (and understandably start looking at the competitor's
offerings), etc. This will require you to write excellent code with
great error handling, write unit tests (and code coverage) to ensure
your code works 100% as intended, but also a LOT of load testing.
Ideally you'd want code reviews too. Trying to find memory leaks or
finding why your app consumes so much memory isn't always easy, quick
nor fun... And your reputation is almost on the line (you don't want
people to say "Jim [or whoever] writes buggy junk! it always crashes!"
do you?) If you plan to sell it, you will also need to support it...
-Features: people will want more features (especially if you wanted
someone to pay for it). After all, tons of very good free proxies have
loads of extra features... From supporting other protocols (like FTP,
WebDAV, etc), reporting features, caching (DNS and pages to accelerate
things and reduce BW usage), tons of options/configuration (perhaps
things like allowing/blocking specific HTTP verbs), etc.
-Coding time: to do all this properly (especially with all the testing,
debugging, optimization and all) will take you maybe not forever, but
still a very long time (especially if you're alone, and even more if you
start counting all the other required overhead - things like designing,
extensive documentation, planning, meetings, refactoring, etc). All this
coding time multiplied by the average programmer's hourly wage will be a
*LOT* of money in any case. Most likely more than it cost to buy any big
commercial offering (like ISA Server). Definitely more than it would
cost to buy a cheap computer (basic Dell plus some more RAM) and throw
something like linux+squid on it (a good, free, time-tested, stable,
full-featured, well documented, set-and-forget solution with support
basically), or some appliance made from it, or another similar solution.
Even as a internal IT dept project it would still be too costly...
-Also, you mention scanning HTTPS, which is SSL encrypted BTW. Normal
proxies don't filter/understand HTTPS traffic, they just pass it. If you
want this too, that means you will also need to make some SSL gateway
that will handle the SSL handshakes and everything else. More fun!
-We're assuming prior (excellent) knowledge of the
language/platform/framework, TCP/HTTP protocols, various RFCs, etc.
And this is just the very tip of the iceberg. Hopefully it helps you
realize what such a project encompasses. You make it sound like it
should be really trivial, but it isn't.