I'm stumped, splitting up a clump of text into useable information

G

Guest

Here's the problem:

The products we sell have a product name, each product has one or more
grades of the product (which usually relate to the quality of it) for each
grade we have a package size and a unique number called a "SKU" that
identifies that grade and package size in our system.

I need to take the source code of a page with all of this information and
turn it into an excel sheet with columns titled "product" "grade" "packaging"
"sku" for each of the grades. The format of the html source code throws in
all sorts of characters that I need to get rid of. It also lumps it into one
block text with no line breaks. The below text file shows the format of two
products, the first product has 2 grades and the second product has 3 grades.
How can I use macros, formulas, anything to get the block of text into the
format I want. It would take me years if I go in by hand and remove the
unwanted characters and manually copy and paste them into cells. Please help
I would appreciate it so so so much. Is this even possible (or feasible).
There is no way I can get any other form of the data.
Here is an example of what I have:

http://www.savefile.com/files/7119995

Example of what I need:

http://www.savefile.com/files/9420878

Happy Holidays Everyone, and a huge thank you for all of the people who help
people on this site!
 
P

Pete

Well, you might be better taking up Chuck's offer ...

I've looked at your text file in Notepad. With word-wrap on and a few
hard-returns, you can see the structure in there quite easily and some
codes are fairly obvious - " for ", < for Left Tab etc. You
could do successive Find & Replace within Notepad to tidy up the file
quite a bit. However, there are other codes - presumably c2, c3 etc
relate to column 2, column 3 - so it's really down to how well you can
recognise patterns as to how long it will take you.

I think the general strategy would be to use Notepad (or Wordpad) to
convert this into a format which could ultimately be read by Excel. You
may still have a lot of tidying up to do once the data has been put
into Excel.

Pete
 
W

wjohnson

Where does the "original" information come from? If possible can yo
include a sample of it, i.e. HTML or whatever
 
G

Guest

Unfortuantely I don't know if I am able to release all of our packaging sizes
and grades, however the attached text file shows the unique characters for
each type of data. If I could just find a way to split up the text into a
new row everytime it sees the word "productmaster" that would cut down on the
time it takes me to do this considerably. The text file below shows the
exact format of the information without line breaks. The full version just
has over 500 products with over 2000 grades. Just incase the text file won't
download here is an example for two products (with a total of 5 product
grades):

ProductMaster><ProductMaster Name="Actafoam(R)"><Product Grade="F2,
Powder"><Grade SKU="0115791"><c2>66.138 lb drum</c2><c3>0115791</c3><c4
null="1"></c4><c5 null="1"></c5></Grade></Product><Product Grade="R-3"><Grade
SKU="0114873"><c2>449.7384 lb drum</c2><c3>0114873</c3><c4 null="1"></c4><c5
null="1"></c5></Grade></Product></ProductMaster><ProductMaster
Name="Acudyne"><Product Grade="180, 48 solid"><Grade SKU="0113541"><c2>473.99
lb drum</c2><c3>0113541</c3><c4 null="1"></c4><c5
null="1"></c5></Grade></Product><Product Grade="DHR, 48 solid"><Grade
SKU="0113550"><c2>473.99 lb drum</c2><c3>0113550</c3><c4 null="1"></c4><c5
null="1"></c5></Grade></Product><Product Grade="SCP 25% Solid"><Grade
SKU="0113548"><c2>473.99 lb drum</c2><c3>0113548</c3><c4 null="1"></c4><c5
null="1"></c5></Grade></Product></

As you can see the letters "productmaster" is an identifier of a new
product, "product grade" is an identifier of a new grade and so on. If I
could at least break it up a little bit into cells I could go through with
the asap utility and clean it up. I don't expect this to be quick, just
quicker.

Thanks!
 
B

Brian Handly

Dorn said:
Unfortuantely I don't know if I am able to release all of our packaging sizes
and grades, however the attached text file shows the unique characters for
each type of data. If I could just find a way to split up the text into a
new row everytime it sees the word "productmaster" that would cut down on the
time it takes me to do this considerably. The text file below shows the
exact format of the information without line breaks. The full version just
has over 500 products with over 2000 grades. Just incase the text file won't
download here is an example for two products (with a total of 5 product
grades):

ProductMaster><ProductMaster Name="Actafoam(R)"><Product Grade="F2,
Powder"><Grade SKU="0115791"><c2>66.138 lb drum</c2><c3>0115791</c3><c4
null="1"></c4><c5 null="1"></c5></Grade></Product><Product Grade="R-3"><Grade
SKU="0114873"><c2>449.7384 lb drum</c2><c3>0114873</c3><c4 null="1"></c4><c5
null="1"></c5></Grade></Product></ProductMaster><ProductMaster
Name="Acudyne"><Product Grade="180, 48 solid"><Grade SKU="0113541"><c2>473.99
lb drum</c2><c3>0113541</c3><c4 null="1"></c4><c5
null="1"></c5></Grade></Product><Product Grade="DHR, 48 solid"><Grade
SKU="0113550"><c2>473.99 lb drum</c2><c3>0113550</c3><c4 null="1"></c4><c5
null="1"></c5></Grade></Product><Product Grade="SCP 25% Solid"><Grade
SKU="0113548"><c2>473.99 lb drum</c2><c3>0113548</c3><c4 null="1"></c4><c5
null="1"></c5></Grade></Product></

As you can see the letters "productmaster" is an identifier of a new
product, "product grade" is an identifier of a new grade and so on. If I
could at least break it up a little bit into cells I could go through with
the asap utility and clean it up. I don't expect this to be quick, just
quicker.

Thanks!
Dorn

1. What software generated the file? Do you have control of that
software such that you could see if it has other export formats that
might result in 1 row per item?
2. You may want to investigate opening the file in WORD and using Find |
Replace to replace "<productmaster" with "Line Break + <productmaster"

Texas Handly
 
P

Pete

This file is slightly different (easier) than that pointed to in the
link in your original post - in that you had combinations like
&quot;|&quot and ;&gt;&lt; which made it more difficult to see the
pattern. Following on from Brian's suggestion, you want to
Find <ProductMaster Name=
Replace with Line Break
as you don't need the text. Similarly,
Find ><Product Grade=
Replace with , (comma), and
Find ><Grade SKU=
Replace with ,

This way you can build up a csv file quite quickly and then bring it
into Excel.

Pete
 
G

Guest

Thanks for all of your help and suggestions. I ended up figuring out how to
do it with a non excel solution. I used edit replace in notepad to change
all of the jibberish into html table tags then I saved it as a .html opened
it with internet explorer and copied and pasted the table into excel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top