php读取doc/docx/pdf解决思路

1、doc

使用linux程序antiword或者catdoc即可,性能优异,但是只可识别文字

catdoc:yum install catdoc即可使用

$content = shell_exec('/usr/bin/catdoc -d utf-8 '.$file);

antiword:

  1. 下载源码包make & make install
  2. cp /root/bin/antiword /usr/local/bin/
    mkdir /usr/share/antiword
    cp -R /root/.antiword/* /usr/share/antiword/
    chmod 777 /usr/local/bin/*antiword
    chmod 755 /usr/share/antiword/*
  3. antiword -t 文件名.doc		文本输出(默认)
    antiword -f 文件名.doc           格式化文本输出
    antiword -m utf-8 文件名.doc  
    
    注意异常处理

 

2、docx

composer安装phpoffice/phpword,转网页读取,文字图片完整还原

$phpWord = \PhpOffice\PhpWord\IOFactory::load($file);
$htmlWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, "HTML");
$content = '';
foreach ($phpWord->getSections() as $section) {
    $writer = new \PhpOffice\PhpWord\Writer\HTML\Element\Container($htmlWriter, $section);
    $content .= $writer->write();
}

 

3、pdf

composer安装\Smalot\PdfParser

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($file);
$content = $pdf->getText();