Oystein
Penultimate Amazing
- Joined
- Dec 9, 2009
- Messages
- 18,903
Hey y'all,
I need someone familiar with some easy web-crawling and XML-reading techniques who can program or customize a crawler for me to read out data records from a website.
Here is the site (may take a minute to load, if you are on a weak computer or connection):
www.ae911truth.org/signatures/ae.html
This page has >2,300 links to personal profiles; each link has a local href like this (I am snipping out attributes):
And that links to full URL's such as
Those .txt files contain stuff like this:
Which I want to have translated into a simpler CSV/spreadsheet like format such as
(Same info must go into same column every time; I believe that all .txt files contain tags for every data item, so it would suffice to output just the data without headers, provided you return an empty field / "|" sign as field delimiter when tag contains no CDATA)
Then the same for same sorts of linked profiles on
I would like to have a little tool that I can run on these URLs whenever I need to update my database: Input is the page with all the names and links, output a list with all profiles. Either something you program from scratch, or perhaps you can recommend a freeware tool that does just that sort of thing and can be configured by a half-witted fellow like myself.
Thanks!!
I need someone familiar with some easy web-crawling and XML-reading techniques who can program or customize a crawler for me to read out data records from a website.
Here is the site (may take a minute to load, if you are on a weak computer or connection):
www.ae911truth.org/signatures/ae.html
This page has >2,300 links to personal profiles; each link has a local href like this (I am snipping out attributes):
Code:
<a data-link="xml/supporters/U/KenGorskiEl-PasoTXUS.xml.txt"></a>
And that links to full URL's such as
Code:
http://www.ae911truth.org/signatures/xml/supporters/U/KenGorskiEl-PasoTXUS.xml.txt
Those .txt files contain stuff like this:
<?xml version="1.0" encoding="UTF-8"?>
<person>
<first_name><![CDATA[Ken]]></first_name>
<middle_name></middle_name>
<last_name><![CDATA[Gorski]]></last_name>
<title></title>
<degree><![CDATA[B Architecture Professional Degree, University of Kansas, 1972]]></degree>
<city><![CDATA[El Paso]]></city>
<state><![CDATA[TX]]></state>
<country><![CDATA[US]]></country>
<occupation_status>Degreed + Licensed</occupation_status>
<tech_biography><![CDATA[I'm a licensed architect and AIA member.]]></tech_biography>
<statement_911><![CDATA[I am supportive of the intent for a complete investigation of the 9/11. Questionable structural and architectural explanations have heretofore been provided to the public.]]></statement_911>
<photo></photo>
<license_info><![CDATA[6477 TX]]></license_info>
</person>
Which I want to have translated into a simpler CSV/spreadsheet like format such as
Code:
url|first_name|middle_name|last_name|title|degree|city|state|country|occupation_status|tech_biography|statement_911|photo|license_info
xml/supporters/U/KenGorskiEl-PasoTXUS.xml.txt|Ken||Gorski||B Architecture Professional Degree, University of Kansas, 1972|El Paso|TX|US|Degreed + Licensed|I'm a licensed architect and AIA member.|I am supportive of the intent for a complete investigation of the 9/11. Questionable structural and architectural explanations have heretofore been provided to the public.|6477 TX
Then the same for same sorts of linked profiles on
Code:
http://www.ae911truth.org/signatures/general.html
http://www.ae911truth.org/signatures/general.html#A
http://www.ae911truth.org/signatures/general.html#B
...
I would like to have a little tool that I can run on these URLs whenever I need to update my database: Input is the page with all the names and links, output a list with all profiles. Either something you program from scratch, or perhaps you can recommend a freeware tool that does just that sort of thing and can be configured by a half-witted fellow like myself.
Thanks!!
