¿Cómo acceder a la publicación en diferentes blogs haciendo solo una función con beautifulSoup durante el escarpado web en python?

0

página html de una de las publicaciones del primer blog

<div class="entry-content">
		<p>We are under the same sky.</p>
<p>You and I.</p>
<p>I share the soul of earth with you,</p>
<p>to contribute a verse too.</p>
<p>I have words to give,</p>
<p>a smile to offer.</p>
<p>You are at your right place.</p>
<p>You live ,you stay ,you move ,you play.</p>
<p>May also have works to do and words to say.</p>
<p>We may cross each other or not.</p>
<p>But the thing is, we are here,</p>
<p>in this instant;So what, not so clear.</p>
<p>But the powerful play goes on,</p>
<p>for you may contribute a verse.</p>
		<div id="wordads-preview-parent" class="wpcnt">
			<div class="wpa">
				<span class="wpa-about">Advertisements</span>
				<div class="u">
					<div class="wpa-notice">
						<p>Occasionally, some of your visitors may see an advertisement here, <br />as well as a <a href="https://en.support.wordpress.com/cookie-widget/" target="_blank">Privacy & Cookies banner</a> at the bottom of the page.<br/>You can hide ads completely by upgrading to one of our paid plans.</p>
						<p class="wpa-buttons">
							<a class="wpa-button is-primary" id="wordads-preview-more" href="https://wordpress.com/plans/141006071/?feature=no-adverts&utm_campaign=removeadsnotive" rel="nofollow" target="_blank">Upgrade now</a>
							<a class="wpa-button" id="wordads-preview-dismiss" href="#">Dismiss message</a>
						</p>
					</div>
				</div>
			</div>
		</div>

PÁGINA HTML DE UNO DE LA PUBLICACIÓN DEL SEGUNDO BLOG

<div class="entry-content">
			<h2><span style="color:#000000;">There are lessons which aren&#8217;t taught</span></h2>
<h2><span style="color:#000000;">Everything black isn&#8217;t always dark<img data-attachment-id="38" data-permalink="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/ea530f2a5c6b48821056deb178ed1747/" data-orig-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg" data-orig-size="500,379" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="ea530f2a5c6b48821056deb178ed1747" data-image-description="" data-medium-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&#038;h=248" data-large-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=490" class="alignright  wp-image-38" src="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&#038;h=248" alt="ea530f2a5c6b48821056deb178ed1747" width="328" height="248" srcset="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&amp;h=248 328w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=150&amp;h=114 150w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=300&amp;h=227 300w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg 500w" sizes="(max-width: 328px) 100vw, 328px" /></span></h2>
<h2><span style="color:#000000;">Everything you love isn&#8217;t always desired</span></h2>
<h2><span style="color:#000000;">Everything you need isn&#8217;t always desired</span></h2>
<h2><span style="color:#000000;">Everything you look isn&#8217;t always watched</span></h2>
<h2><span style="color:#000000;">And everything you do isn&#8217;t always what u did.</span></h2>
<h2><span style="color:#ff0000;">REMEMBER!!!!!</span></h2>
<div id="jp-post-flair" class="sharedaddy sd-like-enabled sd-sharing-enabled"><div class="sharedaddy sd-sharing-enabled"><div class="robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing"><h3 class="sd-title">Share this:</h3><div class="sd-content"><ul><li class="share-press-this"><a rel="nofollow" data-shared="" class="share-press-this sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=press-this" rel="noopener noreferrer" target="_blank" title="Click to Press This!"><span>Press This</span></a></li><li class="share-twitter"><a rel="nofollow" data-shared="sharing-twitter-27" class="share-twitter sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=twitter" rel="noopener noreferrer" target="_blank" title="Click to share on Twitter"><span>Twitter</span></a></li><li class="share-facebook"><a rel="nofollow" data-shared="sharing-facebook-27" class="share-facebook sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=facebook" rel="noopener noreferrer" target="_blank" title="Click to share on Facebook"><span>Facebook</span></a></li><li class="share-google-plus-1"><a rel="nofollow" data-shared="sharing-google-27" class="share-google-plus-1 sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=google-plus-1" rel="noopener noreferrer" target="_blank" title="Click to share on Google+"><span>Google</span></a></li><li class="share-end"></li></ul></div></div></div><div class='sharedaddy sd-block sd-like jetpack-likes-widget-wrapper jetpack-likes-widget-unloaded' id='like-post-wrapper-127135943-27-5b54d1ab0f8b1' data-src='//widgets.wp.com/likes/index.html?ver=20180319#blog_id=127135943&amp;post_id=27&amp;origin=awistfulwind.wordpress.com&amp;obj_id=127135943-27-5b54d1ab0f8b1' data-name='like-post-frame-127135943-27-5b54d1ab0f8b1'><h3 class='sd-title'>Like this:</h3><div class='likes-widget-placeholder post-likes-widget-placeholder' style='height: 55px;'><span class='button'><span>Like</span></span> <span class="loading">Loading...</span></div><span class='sd-text-color'></span><a class='sd-link-color'></a></div></div>		</div><!-- .entry-content -->
	</div><!-- .entry-body -->

por favor, ayúdenme a desechar el contenido de la publicación solo en este html que podría funcionar para ambas publicaciones que también podría usar para otros blogs.

Etiquetas de preguntas:
web-scraping
beautifulsoup

1 respuesta

0

El principal problema es eliminar la publicidad y los banners innecesarios. Hice una función simple scrap_data() , donde proporciona la cadena de datos y devolverá el contenido desechado:

data_1 = """
<div class="entry-content">
        <p>We are under the same sky.</p>
<p>You and I.</p>
<p>I share the soul of earth with you,</p>
<p>to contribute a verse too.</p>
<p>I have words to give,</p>
<p>a smile to offer.</p>
<p>You are at your right place.</p>
<p>You live ,you stay ,you move ,you play.</p>
<p>May also have works to do and words to say.</p>
<p>We may cross each other or not.</p>
<p>But the thing is, we are here,</p>
<p>in this instant;So what, not so clear.</p>
<p>But the powerful play goes on,</p>
<p>for you may contribute a verse.</p>
        <div id="wordads-preview-parent" class="wpcnt">
            <div class="wpa">
                <span class="wpa-about">Advertisements</span>
                <div class="u">
                    <div class="wpa-notice">
                        <p>Occasionally, some of your visitors may see an advertisement here, <br />as well as a <a href="https://en.support.wordpress.com/cookie-widget/" target="_blank">Privacy & Cookies banner</a> at the bottom of the page.<br/>You can hide ads completely by upgrading to one of our paid plans.</p>
                        <p class="wpa-buttons">
                            <a class="wpa-button is-primary" id="wordads-preview-more" href="https://wordpress.com/plans/141006071/?feature=no-adverts&utm_campaign=removeadsnotive" rel="nofollow" target="_blank">Upgrade now</a>
                            <a class="wpa-button" id="wordads-preview-dismiss" href="#">Dismiss message</a>
                        </p>
                    </div>
                </div>
            </div>
        </div>"""

data_2 = """
<div class="entry-content">
            <h2><span style="color:#000000;">There are lessons which aren&#8217;t taught</span></h2>
<h2><span style="color:#000000;">Everything black isn&#8217;t always dark<img data-attachment-id="38" data-permalink="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/ea530f2a5c6b48821056deb178ed1747/" data-orig-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg" data-orig-size="500,379" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="ea530f2a5c6b48821056deb178ed1747" data-image-description="" data-medium-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&#038;h=248" data-large-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=490" class="alignright  wp-image-38" src="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&#038;h=248" alt="ea530f2a5c6b48821056deb178ed1747" width="328" height="248" srcset="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&amp;h=248 328w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=150&amp;h=114 150w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=300&amp;h=227 300w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg 500w" sizes="(max-width: 328px) 100vw, 328px" /></span></h2>
<h2><span style="color:#000000;">Everything you love isn&#8217;t always desired</span></h2>
<h2><span style="color:#000000;">Everything you need isn&#8217;t always desired</span></h2>
<h2><span style="color:#000000;">Everything you look isn&#8217;t always watched</span></h2>
<h2><span style="color:#000000;">And everything you do isn&#8217;t always what u did.</span></h2>
<h2><span style="color:#ff0000;">REMEMBER!!!!!</span></h2>
<div id="jp-post-flair" class="sharedaddy sd-like-enabled sd-sharing-enabled"><div class="sharedaddy sd-sharing-enabled"><div class="robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing"><h3 class="sd-title">Share this:</h3><div class="sd-content"><ul><li class="share-press-this"><a rel="nofollow" data-shared="" class="share-press-this sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=press-this" rel="noopener noreferrer" target="_blank" title="Click to Press This!"><span>Press This</span></a></li><li class="share-twitter"><a rel="nofollow" data-shared="sharing-twitter-27" class="share-twitter sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=twitter" rel="noopener noreferrer" target="_blank" title="Click to share on Twitter"><span>Twitter</span></a></li><li class="share-facebook"><a rel="nofollow" data-shared="sharing-facebook-27" class="share-facebook sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=facebook" rel="noopener noreferrer" target="_blank" title="Click to share on Facebook"><span>Facebook</span></a></li><li class="share-google-plus-1"><a rel="nofollow" data-shared="sharing-google-27" class="share-google-plus-1 sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=google-plus-1" rel="noopener noreferrer" target="_blank" title="Click to share on Google+"><span>Google</span></a></li><li class="share-end"></li></ul></div></div></div><div class='sharedaddy sd-block sd-like jetpack-likes-widget-wrapper jetpack-likes-widget-unloaded' id='like-post-wrapper-127135943-27-5b54d1ab0f8b1' data-src='//widgets.wp.com/likes/index.html?ver=20180319#blog_id=127135943&amp;post_id=27&amp;origin=awistfulwind.wordpress.com&amp;obj_id=127135943-27-5b54d1ab0f8b1' data-name='like-post-frame-127135943-27-5b54d1ab0f8b1'><h3 class='sd-title'>Like this:</h3><div class='likes-widget-placeholder post-likes-widget-placeholder' style='height: 55px;'><span class='button'><span>Like</span></span> <span class="loading">Loading...</span></div><span class='sd-text-color'></span><a class='sd-link-color'></a></div></div>        </div><!-- .entry-content -->
    </div><!-- .entry-body -->"""

from bs4 import BeautifulSoup

def scrap_data(data):
    soup = BeautifulSoup(data, 'lxml')
    # remvove advertisements
    for div in soup.select('div#wordads-preview-parent'):
        div.clear()
    for div in soup.select('div#jp-post-flair'):
        div.clear()
    return soup.select_one('.entry-content').text.strip()

print(scrap_data(data_1))
print('-' * 80)
print(scrap_data(data_2))
print('-' * 80)

Huellas dactilares:

We are under the same sky.
You and I.
I share the soul of earth with you,
to contribute a verse too.
I have words to give,
a smile to offer.
You are at your right place.
You live ,you stay ,you move ,you play.
May also have works to do and words to say.
We may cross each other or not.
But the thing is, we are here,
in this instant;So what, not so clear.
But the powerful play goes on,
for you may contribute a verse.
--------------------------------------------------------------------------------
There are lessons which aren’t taught
Everything black isn’t always dark
Everything you love isn’t always desired
Everything you need isn’t always desired
Everything you look isn’t always watched
And everything you do isn’t always what u did.
REMEMBER!!!!!
--------------------------------------------------------------------------------
La respuesta fue
Fuente
Comunidad Progexpertos
Arriba
Menu