Jump to content

Pdf To Some Sorta Database...


sajal

Recommended Posts

hello,

i was wonderin if anybody knows how to do the following.

i have many files(over 1000) like http://cbr.gov.pk/newcu/igm/kpt475.pdf

they are all in PDF format....and weekly i get around 50 to 100 more new files to add...

the problem is that with these files, if i need to find any info, i would have to search thru the files and its a very tiring process.

now i want to put all the data into some database (not manually of course).

ive tried leads tools, but it only exports to html, doc, rtf, txt....all are useless to me...

i need excel or even better of some nice database program...

exporting to text doesnt insert any sort of control character between fields...if it did that i could code in VB or C myself... :o

any idea how to do it?

if not then atleast how to export the data in the PDF to ASCII text inserting some special character between fields..?

right now i am not using those PDF files to my benifit as i cant make sense of them...if i can get some database runing thru those files, it will really help my business....

Link to comment
Share on other sites

The problem is you need something (someone) intelligent to categorize the different pieces of information. The PDF format is not a structured data format like XML, but rather a procedural drawing format like Postscript.

The apparent table format of the document is not because of an underlying table, but because of a list of commands to draw a bunch of outlines and then a bunch of commands to place text at certain coordinates on the page. We see a table because we are visually parsing the page in two dimensions.

Partial output of ps2ascii on your linked PDF (records 2-5 of first page):

2

1X40' LCL CONTAINER 16 PALLETS STC: 430 BAGS, 50 CARS, 4 CTNS AND 2 DRUM OF (FEED GRADE) SINGAPORE 12.6 16 PALLETS 001X40FT ASIA POULTRY FEEDS

PVT LIMITED ADISEEO ASIA PACIFIC PTE LTD

3 1X20' LCL CONTAINER STC: 1 PALLET OF LABEL SINGAPORE 0.586 1 PALLETS 001X20FT AHMED LACE WORKS (PVT) LTD PRESTIGE LABEL PTE LTD 4

1X40' LCL CONTAINER STC: 9 PACKAGES (9 CASE) OF CANON PRODUCTS SINGAPORE 2.38 9 PACKAGES 001X40FT GEMCO CANON SINGAPORE PTE LTD

5

1X40' LCL CONTAINER 7 PACKAGES STC: 1 CARTON, 6 CASE OF CANON PRODUCTS SINGAPORE 0.927 7 PACKAGES 001X40FT GEMCO CANON SINGAPORE PTE LTD

To give an idea of what you're up against, here is the Postscript representation of the first row. Note that the text commands are not even whole fields, but rather smaller pieces having to do with layout on the page. You would really need some sort of visual parsing strategy to try to group together text items that live in one "cell outline" that is drawn on the page, attempting to reconstruct the text field by concatenating fragments left-to-right and top-to-bottom within that bounding box. Not fun at all...

(1) 4.71911 Tj

128.5 409.5 Td

-0.2821 Tc

0.3191 Tw

(1X20' LCL CONTAINER STC: 4 ) 122.841 Tj

128.5 399 Td

-0.3554 Tc

0.3257 Tw

(PALLETS OF FERROUS ) 97.6329 Tj

128.5 388.5 Td

-0.3842 Tc

0.5212 Tw

(FUMARATE USP) 66.4919 Tj

261.5 388.5 Td

-0.4256 Tc

0 Tw

(ROTTERDAM) 53.752 Tj

326.5 388.5 Td

-0.1534 Tc

(3.448) 21.236 Tj

451.5 388.5 Td

-0.226 Tc

(4) 4.71911 Tj

459.5 388.5 Td

-0.3319 Tc

(PALLETS) 37.2691 Tj

516 388.5 Td

-0.2108 Tc

(001X20FT) 39.6286 Tj

563 399 Td

-0.2921 Tc

0.1791 Tw

(BRISTOL-MYERS SQUIBB ) 106.12 Tj

563 388.5 Td

-0.2682 Tc

0.4052 Tw

(PAKISTAN \(PVT\) LTD) 84.893 Tj

675.5 399 Td

-0.2526 Tc

0.1396 Tw

(M/S. MALLINCKRODE ) 89.1368 Tj

675.5 388.5 Td

-0.2089 Tc

0.3459 Tw

(BAKER B.V.) 47.1741 Tj

95.5 343.5 Td

-0.226 Tc

Link to comment
Share on other sites

tx a ton autonomous_unit for some more insight.

i tried the software LEADTOOLS ePrint IV EVAL, they seem to convert the PDF to word 97 format in which the data is inside a table(not actual table but same as u described). the lines of the table being seperate lines and each line of data within a cell in seperate text boxes....

maybe something could be parsed out from word more easily....

there must be some tools/plugins for word which allows conversion from these drawn lines to actual tables....

gotta good=gle a little harder i guess...

Link to comment
Share on other sites

You can generate a searchable index of a bunch of PDF files using the 'catalog' function in Acrobat. Its just a keyword search, but it works quite well. Better than scraping through them by hand !

Even better....get hold of Copernic's Desktop Search Engine. This FREE from www.copernic.com

This is a highly intelligent indexing tool that will index everything on you hard disk (including pdf and mp3 etc.). After indexing it attaches to your toolbar with a small window. Insert a key word, press the green arrow, and documents are found in lightning speed.

I have over 300GB of files, documents and pdf files. Copernic's Desktop Search Engine has increased my "search" and "find" time by over 100%

Link to comment
Share on other sites

Maybe of interest to you:

http://www.thebeatlesforever.com/processtext/abcpdf.html

Say's the program will convert to a multitude of different formats including mdb(Access) and db(paradox). There's a free 30 day trial version (has some restrictions on number of files but will let you know if it"s worth considering)and the full version retails at just $12.95 - $39.95 (depending on number of licences)

Link to comment
Share on other sites

crushdepth and thomas: i already use the search function in the pdf file.

it also has an option to search thru multiple PDFs at the same time and is very handy...

but the problem with that is the output is too confusing.

asuming you have seen the sample file in the first post...say for instance i want to see the activities of a particular product, i search for the key word, then i have to go thru the search results and see them one by one...i cant get a list generated only for the rows where the particular keyword exists nor can i automatically generate some statistical data out of it....

sniffdog: since the tables in the file are not actual tables but simply lines drawn, all the conversion tools ive tried dont seem to give me the desired output.. the data in the converted file is too jumbled up to use in any sort of database...if u read what autonomous_unit had written, that made the most sense...the data for each field has to be visually parsed from the PDF...but the question is HOW?

im downloadin adobe acrobat(not the reader) then ill see if something can be done from there....

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.






×
×
  • Create New...