kalle07 commited on
Commit
37f76b6
·
verified ·
1 Parent(s): 79b58f4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -7
README.md CHANGED
@@ -37,15 +37,19 @@ Most LLM applications only convert your PDF simple to txt, nothing more, its lik
37
  Therefore its better to convert it with some help of a <b>"Parser"</b>. The embedder can now find a better context.<br>
38
  I work with "<b>pdfplumber/pdfminer</b>" none OCR, so its very fast!<br>
39
  <ul style="line-height: 1.05;">
40
- <li>Works with single and multi pdf list, works with folder</li>
41
- <li>Intelligent multiprocessing</li>
42
  <li>Error tolerant, that means if your PDF is not convertible, it will be skipped, no special handling</li>
43
  <li>Instant view of the result, hit one pdf on top of the list</li>
44
- <li>Converts some common tables as json-foramt inside the txt file, readable for embedder</li>
45
- <li>Adds the absolute PAGE number to each page</li>
46
- <li>Adds the label “chapter” for large font and/or “important” for bold font</li>
47
- <li>All txt files will be created in original folder of PDF</li>
48
- <li>All previous txt files are overwritten</li>
 
 
 
 
49
  <li>aprox 5 to 20 Pages/sec - depends on complexity and system-power</li>
50
  <li>tested on 300 PDF files ~30000 pages</li>
51
  </ul>
 
37
  Therefore its better to convert it with some help of a <b>"Parser"</b>. The embedder can now find a better context.<br>
38
  I work with "<b>pdfplumber/pdfminer</b>" none OCR, so its very fast!<br>
39
  <ul style="line-height: 1.05;">
40
+ <li>Works with single and multi PDF list, works with folder</li>
41
+ <li>Intelligent multiprocessing ~10-20 pages per second</li>
42
  <li>Error tolerant, that means if your PDF is not convertible, it will be skipped, no special handling</li>
43
  <li>Instant view of the result, hit one pdf on top of the list</li>
44
+ <li>Removes about 5% of the margins around the page</li>
45
+ <li>Converts some common tables as json inside the txt file</li>
46
+ <li>Add the absolute PAGE number to each page</li>
47
+ <li>Add the tag “chapter” or “important” to large and/or bold font.</li>
48
+ <li>All txt files will be created in original folder of PDF, same name as *.txt</li>
49
+ <li>All txt files will be overwritten if you start converting with same PDF</li>
50
+ <li>If there are many text blocks on a page, it may be that text blocks that you would read first appear further down the page. (It is a compromise between many layout options)</li>
51
+ <li>Small blocks of text (such as units or individual numbers), usually near diagrams and sketches, appear at the end of each page</li>
52
+ <li>I advise against using a PDF file directly for RAG formatting (embedding), as you never know how it will look, and incorrect input can lead to poor results</li>
53
  <li>aprox 5 to 20 Pages/sec - depends on complexity and system-power</li>
54
  <li>tested on 300 PDF files ~30000 pages</li>
55
  </ul>