Iterating through IP addresses on a server when screen scraping

A

Adam

Hello, my server has five IP addresses associated with it. I scrape
some pages off the net but I would like to be able to use different IP
addresses when scraping.

How can I use a different IP when making a web request?

Say, I have 128.128.128.128 which is an IP I use to host one of my
websites, can I use this IP when screen scraping?

I tried using the webproxy class and passing in an IP off my server
but it did not work.

Thanks
Adam
 
P

Peter Duniho

Hello, my server has five IP addresses associated with it. I scrape
some pages off the net but I would like to be able to use different IP
addresses when scraping.

How can I use a different IP when making a web request?

Say, I have 128.128.128.128 which is an IP I use to host one of my
websites, can I use this IP when screen scraping?

I tried using the webproxy class and passing in an IP off my server
but it did not work.

Your question is very confusing. What are you scraping? Content from
your own server? Or content from some arbitrary other server? How are
you scraping? Are you really grabbing bits from the screen and using OCR
to convert back to text? Or are you just downloading the page and
extracting the text content? If the latter, why are you using the phrase
"screen scraping"?

When you say you want to use a different IP address, do you mean you want
to target a different IP address, or do you mean you want for the web
request to _come_ from a different IP address?

Those are the biggest ambiguities in your question that I see. It would
help a lot if you would rephrase the question in a clearer way.

Thanks,
Pete
 
A

Adam

Your question is very confusing. What are you scraping? Content from
your own server? Or content from some arbitrary other server? How are
you scraping? Are you really grabbing bits from the screen and using OCR
to convert back to text? Or are you just downloading the page and
extracting the text content? If the latter, why are you using the phrase
"screen scraping"?

When you say you want to use a different IP address, do you mean you want
to target a different IP address, or do you mean you want for the web
request to _come_ from a different IP address?

Those are the biggest ambiguities in your question that I see. It would
help a lot if you would rephrase the question in a clearer way.

Thanks,
Pete

Hello Pete, thank you for answering.

To be open, ever since Google did away with their API I have had to
resort to scraping the results of Google instead of using their API.

My issue is that I can only do so many queries before I am asked to
enter a captcha as I am assuming that it is logging my IP.

I therefore would like to be able to send ten requests from one IP and
then cycle through the other IPs' designated to my server (I would
like the request to _come_ from a different IP address on my server).

I could use a third party proxy but I have nothing to hide so do not
mind the requests coming from my server.

Thank you again
Adam
 
B

Ben Voigt [C++ MVP]

Hello Pete, thank you for answering.
To be open, ever since Google did away with their API I have had to
resort to scraping the results of Google instead of using their API.

Don't. What you are doing is illegal. You are accessing a privately owned
computer network and they make it accessible to the public only under
certain conditions, which you're violating by the use of a spider. Without
permission to access their site, you're guilty of electronic trespass (not
sure what the correct legal term is).
My issue is that I can only do so many queries before I am asked to
enter a captcha as I am assuming that it is logging my IP.

I therefore would like to be able to send ten requests from one IP and
then cycle through the other IPs' designated to my server (I would
like the request to _come_ from a different IP address on my server).

If you wanted to make a request to a network resource you own or have
permission to use, and want the request to come from a particular IP address
on your computer, you'd use bind. Bind is rarely called on client TCP
sockets, but it is possible and gives you control over client IP address,
port, or both.
 
P

Peter Duniho

[...]
I therefore would like to be able to send ten requests from one IP and
then cycle through the other IPs' designated to my server (I would
like the request to _come_ from a different IP address on my server).

If you wanted to make a request to a network resource you own or have
permission to use, and want the request to come from a particular IP
address
on your computer, you'd use bind. Bind is rarely called on client TCP
sockets, but it is possible and gives you control over client IP address,
port, or both.

Sort of. With Winsock, BSD sockets, .NET Socket, and I believe Java
sockets as well, binding a socket to a specific address affects _only_
inbound transmissions; that is, the address that has to be used in order
for traffic to reach the socket. The network stack decides based on
routing information what network adapter (and thus what IP address) is
used for outbound transmissions.

The "return address" will match what's been bound, of course. So
depending on what Google's actually looking at (return address or source
address), binding may or may not have the desired effect.

Of course, I wholeheartedly agree with your previous comments about
bypassing Google's policies. They offer a service under certain terms,
and users are required to either abide by those terms or forego using the
service. Circumventing their restrictions is unethical and probably
illegal.

Pete
 
A

Adam

[...]
I therefore would like to be able to send ten requests from one IP and
then cycle through the other IPs' designated to my server (I would
like the request to _come_ from a different IP address on my server).
If you wanted to make a request to a network resource you own or have
permission to use, and want the request to come from a particular IP
address
on your computer, you'd use bind. Bind is rarely called on client TCP
sockets, but it is possible and gives you control over client IP address,
port, or both.

Sort of. With Winsock, BSD sockets, .NET Socket, and I believe Java
sockets as well, binding a socket to a specific address affects _only_
inbound transmissions; that is, the address that has to be used in order
for traffic to reach the socket. The network stack decides based on
routing information what network adapter (and thus what IP address) is
used for outbound transmissions.

The "return address" will match what's been bound, of course. So
depending on what Google's actually looking at (return address or source
address), binding may or may not have the desired effect.

Of course, I wholeheartedly agree with your previous comments about
bypassing Google's policies. They offer a service under certain terms,
and users are required to either abide by those terms or forego using the
service. Circumventing their restrictions is unethical and probably
illegal.

Pete

Thank you for the feedback; I am going to drop Google and just use the
Yahoo API.

Regards
Pete
 
A

Adam

[...]
I therefore would like to be able to send ten requests from one IP and
then cycle through the other IPs' designated to my server (I would
like the request to _come_ from a different IP address on my server).
If you wanted to make a request to a network resource you own or have
permission to use, and want the request to come from a particular IP
address
on your computer, you'd use bind. Bind is rarely called on client TCP
sockets, but it is possible and gives you control over client IP address,
port, or both.
Sort of. With Winsock, BSD sockets, .NET Socket, and I believe Java
sockets as well, binding a socket to a specific address affects _only_
inbound transmissions; that is, the address that has to be used in order
for traffic to reach the socket. The network stack decides based on
routing information what network adapter (and thus what IP address) is
used for outbound transmissions.
The "return address" will match what's been bound, of course. So
depending on what Google's actually looking at (return address or source
address), binding may or may not have the desired effect.
Of course, I wholeheartedly agree with your previous comments about
bypassing Google's policies. They offer a service under certain terms,
and users are required to either abide by those terms or forego using the
service. Circumventing their restrictions is unethical and probably
illegal.

Thank you for the feedback; I am going to drop Google and just use the
Yahoo API.

Regards
Pete

Sorry, Regards Adam (Not Pete!!)
 
B

Ben Voigt [C++ MVP]

Peter said:
[...]
I therefore would like to be able to send ten requests from one IP
and then cycle through the other IPs' designated to my server (I
would like the request to _come_ from a different IP address on my
server).

If you wanted to make a request to a network resource you own or have
permission to use, and want the request to come from a particular IP
address
on your computer, you'd use bind. Bind is rarely called on client
TCP sockets, but it is possible and gives you control over client IP
address, port, or both.

Sort of. With Winsock, BSD sockets, .NET Socket, and I believe Java
sockets as well, binding a socket to a specific address affects _only_
inbound transmissions; that is, the address that has to be used in
order for traffic to reach the socket. The network stack decides
based on routing information what network adapter (and thus what IP
address) is used for outbound transmissions.

The IP address of the adapter used for transmission does not determine the
source address. The address used for bind is used to build the packet and
set the source address. The selection of what link to use to send the
packet is part of routing, and won't change the source IP address (unless
NAT is involved).

Also there's the possibility of virtual interfaces, where many IP addresses
are associated to a single network adapter.
The "return address" will match what's been bound, of course. So
depending on what Google's actually looking at (return address or
source address), binding may or may not have the desired effect.

There's no such distinction between "source" and "return" address.
 
P

Peter Duniho

[...]
The network stack decides
based on routing information what network adapter (and thus what IP
address) is used for outbound transmissions.

The IP address of the adapter used for transmission does not determine
the
source address. The address used for bind is used to build the packet
and
set the source address. The selection of what link to use to send the
packet is part of routing, and won't change the source IP address (unless
NAT is involved).

Hmmm...thanks. Well, that all makes sense, but I'm trying to remember
what it is I was thinking of. I've definitely run into some kind of
address mis-match issue before, but it was years ago. "The vision is
hazy". :) Maybe there's some MAC address thing I'm remembering.

(BTW, I'm assuming you're lumping proxying in with "NAT"...that would
change the source IP address as well. I tend to think of proxies and NAT
routers as different things -- proxies often being more explicit in
configuration -- but I suppose that could be considered somewhat of an
arbitrary distinction, depending on context).

Anyway, sounds like the issue is moot now anyway. Which is how I was
hoping things would turn out. :) (The best problems are ones that are
solved by just making them go away :) ).

Pete (or is it Adam? ;) )
 
A

Arne Vajhøj

Adam said:
To be open, ever since Google did away with their API I have had to
resort to scraping the results of Google instead of using their API.

My issue is that I can only do so many queries before I am asked to
enter a captcha as I am assuming that it is logging my IP.

I therefore would like to be able to send ten requests from one IP and
then cycle through the other IPs' designated to my server (I would
like the request to _come_ from a different IP address on my server).

Check:

http://www.google.com/accounts/TOS

item 5.3

Arne
 
A

Arne Vajhøj

Peter said:
Sort of. With Winsock, BSD sockets, .NET Socket, and I believe Java
sockets as well, binding a socket to a specific address affects _only_
inbound transmissions; that is, the address that has to be used in order
for traffic to reach the socket. The network stack decides based on
routing information what network adapter (and thus what IP address) is
used for outbound transmissions.

You can bind client addresses.

And it is rather simple in .NET:

Socket s = new Socket(AddressFamily.InterNetwork, SocketType.Stream,
ProtocolType.Tcp);
s.Bind(new IPEndPoint(IPAddress.Parse(localip), localport));
s.Connect(new IPEndPoint(IPAddress.Parse(remoteip), remoteport));

It is the same in C (in fact I tend to always call bind for client
sockets in C).

I believe you can also call bind in Java, but the Socket class
has some convenience constructors that are easier to use.

Arne
 
B

Ben Voigt [C++ MVP]

Peter Duniho said:
[...]
The network stack decides
based on routing information what network adapter (and thus what IP
address) is used for outbound transmissions.

The IP address of the adapter used for transmission does not determine
the
source address. The address used for bind is used to build the packet
and
set the source address. The selection of what link to use to send the
packet is part of routing, and won't change the source IP address (unless
NAT is involved).

Hmmm...thanks. Well, that all makes sense, but I'm trying to remember
what it is I was thinking of. I've definitely run into some kind of
address mis-match issue before, but it was years ago. "The vision is
hazy". :) Maybe there's some MAC address thing I'm remembering.

A layer 3 (IP) router would change the layer 2 (MAC) address. While a layer
2 router (more commonly called a switch) would not.
(BTW, I'm assuming you're lumping proxying in with "NAT"...that would
change the source IP address as well. I tend to think of proxies and NAT
routers as different things -- proxies often being more explicit in
configuration -- but I suppose that could be considered somewhat of an
arbitrary distinction, depending on context).

To be more precise, proxies create a second connection alongside the first
(TCP options and sequence numbers, etc are never exchanged between the
client and server). NAT does not, it actually rewrites the IP header of the
original packets. PAT (which is commonly called just NAT) rewrites both the
IP and layer 4 (TCP or UDP) header. Transparent proxies are implicit in
configuration like NAT and involve both rewriting the packet (unless your
NAT and proxy server are integrated) and a second connection.



__________ Information from ESET NOD32 Antivirus, version of virus signature database 4262 (20090720) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top