PHP教程之抓取网页内容类 function,content,return,图片,网页

若相依 发表于 2015-2-4 00:09:20

PHP教程之抓取网页内容类

怎么样出来了吧，怎么样自己也可以写出php程序了，虽然离职业和专业的人还有很远，但是好的开始是成功的一半。这个时候改怎么做了呢。现在就是拿1本高手推荐的书，重头到尾读1遍，我说的这个读是自己看。网页 <?php
class Thief{
// 需求失掉数据的网址
var $URL;
// 需求剖析的入手下手标志
var $startFlag;
//需求剖析的停止标志
var $endFlag;
//存储图片的途径
var $saveImagePath;
//会见图片的途径
var $imageURL;
// 列表内容
var $ListContent;
//需求取得的图片途径
var $ImageList;
//存储的图片称号
var $FileName;

/**
* 失掉页面内容
* @return String 列表页面内容
*/

function getPageContent ()
{
$pageContent = @file_get_contents( $this->URL );

return $pageContent;
}

/**
* 依据标志失掉列表段
* @param $content页面源数据
* @return String 列表段内容
*/

function getContentPiece ( $content )
{
$content = $this->getContent( $content, $this->startFlag, $this->endFlag );
                                                if(!$content)$content=$this->cut ($content, $this->startFlag, $this->endFlag );
return $content;
}

/**
* 失掉一个字符串中的某一局部
* @param $sourceStr 源数据
* @param $startStr 分别局部的入手下手标志
* @param $endStart 分别局部的停止标志
* @return boolean操作胜利前往true
*/

function getContent ( $sourceStr, $startStr, $endStart )
{
$s = preg_quote( decode( $startStr ) );
$e = preg_quote( decode( $endStart ) );
$s = str_replace( " ", "[[:space:]]", $s );
$e = str_replace( " ", "[[:space:]]", $e );
$s = str_replace( "\r\n", "[[:cntrl:]]", $s );
$e = str_replace( "\r\n", "[[:cntrl:]]", $e );
preg_match_all( "@" . $s . "(.*?)". $e ."@is", $sourceStr, $tpl );
$content = $tpl;
$content = implode( "", $content );
return $content;
}

function cut ( $sourceStr, $startStr, $endStr )
{
                                             returncut( $sourceStr ,decode( $startStr ),decode( $endStr) );
                                 }

/**
* 失掉只含有毗连和内容的列表数组
* @param $sList页面列表源数据
* @return array列表段内容
*/

function getSourceList ( $sList )
{
preg_match_all( "/<a[[:space:]](.*?)<\/a>/i", $sList, $list );
$list = $list;
//foreach($list as $l) echo $l;
                                                if(!$list || !is_array($list)){
                                                               return $this->getSourceListExtend($sList);
                                                }else{
                              return $this->getList ( $list );
                                                }

}

                              function getSourceListExtend($sList)
                              {
                                             $content=explode("</a>",$sList);
                                             for($i=0;$i<count($content)-1;$i++)
                                             {
                                                      $lists=explode("<a",$content[$i]);
                                                      $list[]=$lists;
                                             }
                                                      return $this->GetListExtend( $list );
                              }

/**
* 失掉列表内容
* @param $list列表段内容
* @return array含有题目和毗连的数组
*/

function getList ( $list )
{
for ( $i = 0; $i < count( $list ); $i++ )
{
//title
preg_match_all( "/>(.*?)<\/a>/i", $list[$i], $templ );
//content
preg_match_all( "/href=(\"|'|)(.*?)(\"|'|)/i", $list[$i], $tempc );

//获得的数据准确
if( !empty( $templ ) && !empty( $tempc ) )
{
if( 0 == strpos( $tempc, "/" ) )
{
   preg_match( "@http://(.*?)/@i", $this->URL, $url );
   $tempc = substr( $url, 0, strlen( $url ) - 1 ) . $tempc;
}

$listContent[$i] = $templ;
   $listContent[$i] = $tempc;
}
                                                }
                                                if(!$listContent || !is_array($listContent)){
                                                            return $this->GetListExtend ( $list );
                                                }else{
         return $listContent;
                                                }
}
function GetListExtend ( $list )
{
                                                $list=str_replace("\"","",$list);
                                                $list=str_replace("'","",$list);
                                                $list=str_replace("=","",$list);
for ( $i = 0; $i <count( $list ); $i++ )
{
//content
$temp_link=$this->cut($list[$i],"href"," ");
                                                                  echo $temp_link."<br>";
//title
                                                                  if(eregi(">",$list[$i])){
                                                                           $temp_title=substr(strrchr($list[$i], ">"), 1 );
                                                                           $temp_title=preg_replace( "@\<(.*?)\>@is","",$temp_title);
                                                                           $temp_title=str_replace( ">","",$temp_title);
                                                                           $temp_title=str_replace( "<","",$temp_title);
                                                                           if(!$temp_title) $temp_title=$list[$i] ;
                                                                           $temp_title=preg_replace( "@\<(.*?)\>@is","",$temp_title);
                                                                           $temp_title=str_replace( ">","",$temp_title);
                                                                           $temp_title=str_replace( "<","",$temp_title);
                                                                           echo $temp_title."<br>";
                                                                  }else{
                                                                        $temp_title=$list[$i];
                                                                        $temp_title=preg_replace( "@\<(.*?)\>@is","",$temp_title);
                                                                           $temp_title=str_replace( ">","",$temp_title);
                                                                           $temp_title=str_replace( "<","",$temp_title);
                                                                           echo $temp_title."<br>";
                                                                  }
//获得的数据准确
if( !empty( $temp_link ) && !empty( $temp_title) )
{
if( 0 == strpos( $tempc, "/" ) )
{
   preg_match( "@http://(.*?)/@i", $this->URL, $url );
   $temp_link = substr( $url, 0, strlen( $url ) - 1 ) . $temp_link;
}

$listContent[$i] = trim($temp_title);
$listContent[$i] = $temp_link;
}
                                                }
return $listContent;
                              }

/**
* 失掉注释中的图片途径信息
* @param $content 注释信息
* @return array信息中图片途径的数组
*/

function getImageList ( $content )
{
preg_match_all( "/src=(\"|')(.*?)(\"|')/i", $content, $temp );

$imageList = $temp;
return array_unique($imageList);
}

/**
* 下载图片时将页面中的途径交换成新的途径
* @param $content需求交换途径的页面内容
* @return String 交换后的页面内容
*/

function replaceImageParh ( $content )
{
for ( $i = 0; $i < count( $this->ImageList ); $i++ )
{
                                                                  if($this->FileName[$i]){
               $content = str_replace( $this->ImageList[$i], $this->imageURL.$this->FileName[$i], $content );
                                                                  }else{
                                                                                 //$s=" /src=(\\\"|')".preg_quote($this->ImageList[$i])."(\\\"|')/i";
               $content = str_replace($this->ImageList[$i], $GLOBALS."images/nopic.gif", $content );
                                                                  }
}

return $content;
}

/**
* 下载图片时读取图片文件后存储在响应途径
* @param $imageURL 需求读取的图片文件
* @return boolean操作胜利前往true
*/

function saveImage ( $imageURL )
{

for ( $i = 0; $i < count( $imageURL ); $i++ )
{
$fName = $this->saveFile( $imageURL[$i] );
if( !empty( $fName ) )
{
$filename[$i] = $fName;
}
}

return $filename;
}

function saveFile( $fileName )
{

$s_filename = basename( $fileName );
$ext_name = strtolower( strrchr( $s_filename, "." ) );

if( ( ".jpg" && ".gif" && ".swf" ) != strtolower( $ext_name ) )
{
return "";
}

if( 0 == strpos( $fileName, "/" ) )
{
preg_match( "@http://(.*?)/@i", $this->URL, $url );
$url = $url;
}

if( 0 == strpos( $fileName, "." ) )
{
$url = substr( $this->URL, 0, strrpos( $fileName, "/" ) );
}

$contents = @file_get_contents( $url . $fileName );
$s_filename = time(). rand( 1000, 9999 ) . $ext_name;

//file_put_contents( $this->saveImagePath.$s_filename, $contents );

$handle = @fopen ( $this->saveImagePath.$s_filename, "w" );
@fwrite( $handle, $contents );
@fclose($handle);
if(filesize($this->saveImagePath.$s_filename)>3072){
         return $s_filename;
                                                }else{
                                                            @unlink($this->saveImagePath.$s_filename);
         return "";
                                                }

}

/**
* 不下载图片则格局化其途径为相对途径
                              * 不克不及格局化反常途径 Eg: ./../or /./../ 一类的不外不影响了局
* @param $imageURL 需求读取的图片文件
* @return $filename前往格局化的图片途径
*/
                              functionToPath($imageURL)
                              {
                                                $PathArray=parse_url($this->URL);
                                                $webpath=$PathArray."://".$PathArray ;
                                                $filepath=$PathArray ;
                                       for ( $i = 0; $i < count( $imageURL ); $i++ )
{
                                                            if( substr( $imageURL[$i] ,0,1 )== '/' ){
                                                                        $filename[$i] =$webpath.$imageURL[$i];
                                                            }elseif( substr( $imageURL[$i] ,0,2 )== './' ){
                                                                        $filename[$i] =$webpath.$filepath.substr( $imageURL[$i] ,1, strlen( $imageURL[$i]) );
                                                            }elseif( substr( $imageURL[$i] ,0,3 )== '../' ){
                                                                        $index=strrchr($filepath,"/");
                                                                        $filename[$i] =$webpath.substr($filepath,0,$index).substr($imageURL[$i] ,2, strlen( $imageURL[$i]));
                                                            }elseif(substr( $imageURL[$i] ,0,4)== 'http'){
                                                                        $filename[$i] =$imageURL[$i] ;
                                                            }else{

                                                            }
}

return $filename;
                              }
/**
* 不下载图片时将页面中的途径交换成新的途径
* @param $content需求交换途径的页面内容
* @return String 交换后的页面内容
*/
                              function ImgPathWordStr( $content )
                              {
for ( $i = 0; $i < count( $this->ImageList ); $i++ )
{
$content = str_replace( $this->ImageList[$i], $this->FileName[$i], $content );
}

return $content;
                              }

function setURL ( $u )
{
$this->URL = $u;
return true;
}

function setStartFlag ( $s )
{
$this->startFlag = $s;
return true;
}

function setEndFlag ( $e )
{
$this->endFlag = $e;
return true;
}

function setSaveImagePath ( $p )
{
$this->saveImagePath = $p;
return true;
}

function setImageURL ( $i )
{
$this->imageURL = $i;
return true;
}

}
?>

理解网站这一概念之后不难看出，任何网站都是由网页组成的，也就是说想完成网站，必须先学会做网页，因此必须要掌握了HTML，才能为今后制作网站打下基础。

再见西城 发表于 2015-2-4 09:29:40

当留言板完成的时候，下步可以把做1个单人的blog程序，做为目标，

兰色精灵 发表于 2015-2-9 21:20:56

没接触过框架的人，也不用害怕，其实框架就是一种命名规范及插件，学会一个框架其余的框架都很好上手的。

小妖女 发表于 2015-2-10 01:12:13

学习php的目的往往是为了开发动态网站，phper就业的要求也涵盖了很多。我大致总结为：精通php和mysql

分手快乐 发表于 2015-2-10 04:16:50

我还是强烈建议自己搭建php环境。因为在搭建的过程中你会遇到一些问题，通过搜索或是看php手册解决问题后，你会更加深刻的理解它们的工作原理，了解到php配置文件中的一些选项设置。

admin 发表于 2015-2-22 01:13:51

写的比较杂，因为我也是个新手，不当至于大家多多指正。

再现理想 发表于 2015-3-6 22:13:32

先学习php和mysql,还有css(html语言很简单)我认为现在的效果比以前的方法好。

冷月葬花魂 发表于 2015-3-9 01:22:50

真正的方向了，如果将来要去开发团队，你一定要学好smarty ,phplib这样的模板引擎，

蒙在股里 发表于 2015-3-16 19:19:05

在我安装pear包的时候老是提示，缺少某某文件，才发现那群extension 的排列是应该有一点的顺序，而我安装的版本的排序不是正常的排序。没办法我只好把那群冒号加了上去，只留下我需要使用的扩展。

乐观发表于 2015-3-19 18:36:40

,熟悉html,能用div+css，还有javascript，优先考虑linux。我在开始学习的时候，就想把这些知识一起学习，我天真的认为同时学习能够互相呼应，因为知识是相通的。

小女巫 发表于 2015-3-20 21:22:32

学习php的目的往往是为了开发动态网站，phper就业的要求也涵盖了很多。我大致总结为：精通php和mysql

灵魂腐蚀 发表于 2015-4-6 17:16:01

使用 jquery 等js框架的时候，要随时注意浏览器的更新情况，不然很容易发生框架不能使用。

精灵巫婆 发表于 2015-4-17 05:55:10

活着的死人 发表于 2015-4-21 07:18:20

有位前辈曾经跟我说过，phper 至少要掌握200个函数编起程序来才能顺畅点，那些不熟悉的函数记不住也要一拿手册就能找到。所以建议新手们没事就看看php的手册（至少array函数和string函数是要记牢的）。

小魔女 发表于 2015-4-21 14:32:56

对于懒惰的朋友，我推荐php的集成环境xampp或者是wamp。这两个软件安装方便，使用简单。但是我还是强烈建议自己动手搭建开发环境。

若相依 发表于 2015-5-1 15:19:49

当留言板完成的时候，下步可以把做1个单人的blog程序，做为目标，

莫相离 发表于 2015-5-1 17:12:01

Ps：以上纯属原创，如有雷同，纯属巧合

金色的骷髅 发表于 2015-5-6 13:12:11

曾经犯过一个很低级的错误，我在文件命名的时候用了一个横线\\\\\\\'-\\\\\\\' 号，结果找了好几个小时的错误，事实是命名的时候是不能用横线 \\\\\\\'-\\\\\\\' 的，应该用的是下划线\\\\\\\'_\\\\\\\' ;

谁可相欹 发表于 2015-5-12 14:06:53

至于模板嘛，各位高人一直以来就是争论不休，我一只小菜鸟就不加入战团啦，咱们新手还是多学点东西的好。

爱飞发表于 2015-6-16 20:11:14

本文当是我的笔记啦，遇到的问题随时填充

页: [1] 2

仓酷云's Archiver

PHP教程之抓取网页内容类