Extract data from website code.

  • Thread starter mattwatson.mail
  • Start date
M

mattwatson.mail

There seems to be many ways to skin this cat. I'm most proficient with Excel, but could pick up and run with most anything.

My goal:

Take HTML source of website.
Search for item listings in HTML. (Panasonic TV - 42" - LCD)
Return 3 values for each listing. (Model, size, style)


In looking at the HTML I'm not worried that the data will be too hard to identify. I have found all the desired info quite easily. What would be the best way to grab and separate this data? I feel like I'd need a combination of word/excel to get the raw code into something I could separate by column? I don't really have any macro experience, but if that's an option I coulddefinitely give that a go.

Thanks!
 
C

Claus Busch

Hi,

Am Fri, 8 Nov 2013 10:39:21 -0800 (PST) schrieb
(e-mail address removed):
Take HTML source of website.
Search for item listings in HTML. (Panasonic TV - 42" - LCD)
Return 3 values for each listing. (Model, size, style)

if you have Panasonic TV - 42" - LCD in A1 then
Data => TextToColumns => Delimited => Delimiter = Hyphen => Finish


Regards
Claus B.
 
M

mattwatson.mail

I apologize, I probably should have given specifics.

<h2><a class="thm-hglight-text_color" href="/auto/new-2014-nissan-altima-25_s/664277/">New 2014 Nissan Altima 2.5 S </a></h2>
I need the "New 2014 Nissan Altima 2.5 S"

<dd data-price="23247" class="vehicle_price price_tp-msrp price_strike">$23,247</dd>
I need the "$23247"

<dd data-price="19277" class="vehicle_price price_tp-selling ">$19,277</dd>
I need the "19277"

This is buried in about 25 lines of code per vehicle on this page then 80 vehicles per page. I think the FIRST obstacle would be to JUST grab those 3 lines I want. I can't seem to think of the best way to do that. Is there a way to select and delete lines that do/do not include specific words/characters?

If I only got the lines I needed I could use TextToColumns to get everything after a ">". From there I could use the =Left(6) minus 1 to get that price/vehicle I needed.
 
M

mattwatson.mail

Halfway there. I filtered by lines containing "vehicle_price price_tp-msrp price_strike" and got at least one line I needed then used TextToColumns toget the number between < and >. Unfortunately via this method I have to dothis 3 times and then get the data sets paired up. I appreciate the idea of using TextToColumns.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top