sajal Posted March 24, 2005 Share Posted March 24, 2005 hello, i was wonderin if anybody knows how to do the following. i have many files(over 1000) like http://cbr.gov.pk/newcu/igm/kpt475.pdf they are all in PDF format....and weekly i get around 50 to 100 more new files to add... the problem is that with these files, if i need to find any info, i would have to search thru the files and its a very tiring process. now i want to put all the data into some database (not manually of course). ive tried leads tools, but it only exports to html, doc, rtf, txt....all are useless to me... i need excel or even better of some nice database program... exporting to text doesnt insert any sort of control character between fields...if it did that i could code in VB or C myself... any idea how to do it? if not then atleast how to export the data in the PDF to ASCII text inserting some special character between fields..? right now i am not using those PDF files to my benifit as i cant make sense of them...if i can get some database runing thru those files, it will really help my business.... Link to comment Share on other sites More sharing options...
autonomous_unit Posted March 24, 2005 Share Posted March 24, 2005 The problem is you need something (someone) intelligent to categorize the different pieces of information. The PDF format is not a structured data format like XML, but rather a procedural drawing format like Postscript. The apparent table format of the document is not because of an underlying table, but because of a list of commands to draw a bunch of outlines and then a bunch of commands to place text at certain coordinates on the page. We see a table because we are visually parsing the page in two dimensions. Partial output of ps2ascii on your linked PDF (records 2-5 of first page): 2 1X40' LCL CONTAINER 16 PALLETS STC: 430 BAGS, 50 CARS, 4 CTNS AND 2 DRUM OF (FEED GRADE) SINGAPORE 12.6 16 PALLETS 001X40FT ASIA POULTRY FEEDS PVT LIMITED ADISEEO ASIA PACIFIC PTE LTD 3 1X20' LCL CONTAINER STC: 1 PALLET OF LABEL SINGAPORE 0.586 1 PALLETS 001X20FT AHMED LACE WORKS (PVT) LTD PRESTIGE LABEL PTE LTD 4 1X40' LCL CONTAINER STC: 9 PACKAGES (9 CASE) OF CANON PRODUCTS SINGAPORE 2.38 9 PACKAGES 001X40FT GEMCO CANON SINGAPORE PTE LTD 5 1X40' LCL CONTAINER 7 PACKAGES STC: 1 CARTON, 6 CASE OF CANON PRODUCTS SINGAPORE 0.927 7 PACKAGES 001X40FT GEMCO CANON SINGAPORE PTE LTD To give an idea of what you're up against, here is the Postscript representation of the first row. Note that the text commands are not even whole fields, but rather smaller pieces having to do with layout on the page. You would really need some sort of visual parsing strategy to try to group together text items that live in one "cell outline" that is drawn on the page, attempting to reconstruct the text field by concatenating fragments left-to-right and top-to-bottom within that bounding box. Not fun at all... (1) 4.71911 Tj 128.5 409.5 Td -0.2821 Tc 0.3191 Tw (1X20' LCL CONTAINER STC: 4 ) 122.841 Tj 128.5 399 Td -0.3554 Tc 0.3257 Tw (PALLETS OF FERROUS ) 97.6329 Tj 128.5 388.5 Td -0.3842 Tc 0.5212 Tw (FUMARATE USP) 66.4919 Tj 261.5 388.5 Td -0.4256 Tc 0 Tw (ROTTERDAM) 53.752 Tj 326.5 388.5 Td -0.1534 Tc (3.448) 21.236 Tj 451.5 388.5 Td -0.226 Tc (4) 4.71911 Tj 459.5 388.5 Td -0.3319 Tc (PALLETS) 37.2691 Tj 516 388.5 Td -0.2108 Tc (001X20FT) 39.6286 Tj 563 399 Td -0.2921 Tc 0.1791 Tw (BRISTOL-MYERS SQUIBB ) 106.12 Tj 563 388.5 Td -0.2682 Tc 0.4052 Tw (PAKISTAN \(PVT\) LTD) 84.893 Tj 675.5 399 Td -0.2526 Tc 0.1396 Tw (M/S. MALLINCKRODE ) 89.1368 Tj 675.5 388.5 Td -0.2089 Tc 0.3459 Tw (BAKER B.V.) 47.1741 Tj 95.5 343.5 Td -0.226 Tc Link to comment Share on other sites More sharing options...
sajal Posted March 24, 2005 Author Share Posted March 24, 2005 tx a ton autonomous_unit for some more insight. i tried the software LEADTOOLS ePrint IV EVAL, they seem to convert the PDF to word 97 format in which the data is inside a table(not actual table but same as u described). the lines of the table being seperate lines and each line of data within a cell in seperate text boxes.... maybe something could be parsed out from word more easily.... there must be some tools/plugins for word which allows conversion from these drawn lines to actual tables.... gotta good=gle a little harder i guess... Link to comment Share on other sites More sharing options...
Crushdepth Posted March 24, 2005 Share Posted March 24, 2005 You can generate a searchable index of a bunch of PDF files using the 'catalog' function in Acrobat. Its just a keyword search, but it works quite well. Better than scraping through them by hand ! Link to comment Share on other sites More sharing options...
sniffdog Posted March 24, 2005 Share Posted March 24, 2005 This was discussed recently PDF->Excel .....! Just convert to Word with Lead Tools and export to Excel. Link to comment Share on other sites More sharing options...
Thomas_Merton Posted March 24, 2005 Share Posted March 24, 2005 You can generate a searchable index of a bunch of PDF files using the 'catalog' function in Acrobat. Its just a keyword search, but it works quite well. Better than scraping through them by hand ! <{POST_SNAPBACK}> Even better....get hold of Copernic's Desktop Search Engine. This FREE from www.copernic.com This is a highly intelligent indexing tool that will index everything on you hard disk (including pdf and mp3 etc.). After indexing it attaches to your toolbar with a small window. Insert a key word, press the green arrow, and documents are found in lightning speed. I have over 300GB of files, documents and pdf files. Copernic's Desktop Search Engine has increased my "search" and "find" time by over 100% Link to comment Share on other sites More sharing options...
slimdog Posted March 24, 2005 Share Posted March 24, 2005 Maybe of interest to you: http://www.thebeatlesforever.com/processtext/abcpdf.html Say's the program will convert to a multitude of different formats including mdb(Access) and db(paradox). There's a free 30 day trial version (has some restrictions on number of files but will let you know if it"s worth considering)and the full version retails at just $12.95 - $39.95 (depending on number of licences) Link to comment Share on other sites More sharing options...
sajal Posted March 24, 2005 Author Share Posted March 24, 2005 crushdepth and thomas: i already use the search function in the pdf file. it also has an option to search thru multiple PDFs at the same time and is very handy... but the problem with that is the output is too confusing. asuming you have seen the sample file in the first post...say for instance i want to see the activities of a particular product, i search for the key word, then i have to go thru the search results and see them one by one...i cant get a list generated only for the rows where the particular keyword exists nor can i automatically generate some statistical data out of it.... sniffdog: since the tables in the file are not actual tables but simply lines drawn, all the conversion tools ive tried dont seem to give me the desired output.. the data in the converted file is too jumbled up to use in any sort of database...if u read what autonomous_unit had written, that made the most sense...the data for each field has to be visually parsed from the PDF...but the question is HOW? im downloadin adobe acrobat(not the reader) then ill see if something can be done from there.... Link to comment Share on other sites More sharing options...
slimdog Posted March 25, 2005 Share Posted March 25, 2005 If you have not already downloaded the ifilter (ver 6.0) then maybe worth a look: http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611 Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now