Chapter 8 Hacking Strings
8.1 Introduction
In this chapter, we will learn to work with string data in R using stringr. As we did in the other chapters, we will use a case study to explore the various features of the stringr package. We will use the following R packages:
library(stringr)
library(tibble)
library(magrittr)
library(purrr)
library(dplyr)
library(readr)
8.2 Case Study
- extract domain name from random email ids
- extract image type from url
- extract image dimension from url
- extract extension from domain name
- extract http protocol from url
- extract file type from url
8.2.1 Data
<- read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/mock_strings.csv')
mockstring mockstring
## # A tibble: 1,000 x 12
## id
## <dbl>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
## image_url
## <chr>
## 1 https://robohash.org/providentassumendaexplicabo.jpg?size=50x50&set=set1
## 2 https://robohash.org/etillumvoluptate.jpg?size=50x50&set=set1
## 3 https://robohash.org/nonoptiovoluptatibus.jpg?size=50x50&set=set1
## 4 https://robohash.org/voluptatumauthic.jpg?size=50x50&set=set1
## 5 https://robohash.org/placeaterrorqui.jpg?size=50x50&set=set1
## 6 https://robohash.org/temporeutea.jpg?size=50x50&set=set1
## 7 https://robohash.org/maximesaepequi.bmp?size=50x50&set=set1
## 8 https://robohash.org/nemoautesse.png?size=50x50&set=set1
## 9 https://robohash.org/odiorerumaut.png?size=50x50&set=set1
## 10 https://robohash.org/omnismolestiaearchitecto.png?size=50x50&set=set1
## domain imageurl
## <chr> <chr>
## 1 addtoany.com http://dummyimage.com/130x183.jpg/dddddd/000000
## 2 gmpg.org http://dummyimage.com/106x217.bmp/dddddd/000000
## 3 samsung.com http://dummyimage.com/146x127.bmp/cc0000/ffffff
## 4 spotify.com http://dummyimage.com/181x194.png/5fa2dd/ffffff
## 5 wunderground.com http://dummyimage.com/220x123.jpg/ff4444/ffffff
## 6 alexa.com http://dummyimage.com/118x176.bmp/dddddd/000000
## 7 google.it http://dummyimage.com/185x202.jpg/ff4444/ffffff
## 8 ed.gov http://dummyimage.com/223x163.jpg/ff4444/ffffff
## 9 jigsy.com http://dummyimage.com/145x113.jpg/5fa2dd/ffffff
## 10 jugem.jp http://dummyimage.com/238x214.png/cc0000/ffffff
## email filename phone
## <chr> <chr> <chr>
## 1 mnewburn0@fastcompany.com PedeMalesuada.xls 66-(777)902-6181
## 2 mdankersley1@digg.com LobortisVel.mp3 351-(422)736-6807
## 3 hgirhard2@altervista.org CongueDiamId.pdf 33-(371)684-5114
## 4 pmcmenamy3@sciencedirect.com EleifendQuam.avi 86-(410)823-6712
## 5 drisbrough4@bandcamp.com PurusPhasellus.mp3 223-(518)814-6361
## 6 cphlippi5@surveymonkey.com ElementumInHac.avi 420-(760)354-8671
## 7 kdodswell6@un.org Mattis.doc 1-(712)615-2879
## 8 vhourihane7@ovh.net PurusEu.tiff 62-(437)705-1118
## 9 rdike8@timesonline.co.uk JustoEtiamPretium.xls 1-(683)965-1323
## 10 tdudbridge9@clickbank.net Ante.tiff 30-(553)559-7448
## address
## <chr>
## 1 8 Anhalt Crossing
## 2 697 East Avenue
## 3 89 Dottie Circle
## 4 98135 Blue Bill Park Drive
## 5 7814 Pennsylvania Street
## 6 4897 Little Fleur Drive
## 7 53541 Morrow Center
## 8 4819 Hermina Parkway
## 9 68096 Monument Park
## 10 9595 Spaight Avenue
## url
## <chr>
## 1 https://engadget.com/nascetur/ridiculus/mus/vivamus/vestibulum.jsp?eu=est&ti~
## 2 http://delicious.com/phasellus/in/felis/donec.json?interdum=risus&mauris=dap~
## 3 https://w3.org/sed/augue/aliquam/erat/volutpat.json?dictumst=mi&morbi=sit&ve~
## 4 http://indiatimes.com/pede/lobortis/ligula/sit/amet.jpg?quam=nullam&sollicit~
## 5 https://tumblr.com/id/mauris/vulputate/elementum.png?tincidunt=maecenas&eget~
## 6 https://unblog.fr/est/quam/pharetra.jpg?amet=phasellus&erat=sit&nulla=amet&t~
## 7 http://vinaora.com/posuere.jpg?convallis=in&nulla=faucibus&neque=orci&libero~
## 8 https://globo.com/accumsan.png?elementum=eu&pellentesque=mi&quisque=nulla&po~
## 9 https://xing.com/elementum/eu/interdum/eu/tincidunt.html?sit=proin&amet=eu&s~
## 10 https://bigcartel.com/tortor/quis/turpis/sed/ante/vivamus.html?in=lorem&elei~
## full_name currency passwords
## <chr> <chr> <chr>
## 1 Mufi Ruit ¥34.37 VybPYpEXUjJh6nQk
## 2 Leese Furmagier $67.37 mxET3n6dz42X8YUv
## 3 Blakelee Wilshire €33,85 Z9f4WeNVQ28FwKML
## 4 Terencio McIllrick €42,89 Ndbm8nwCps6jUze3
## 5 Debee McErlaine €13,19 U3Lj9xJw8NHzB5Sg
## 6 Fran Painten ¥87.35 KEhVAC3QNvjWDFJ7
## 7 Frasco Bowich $34.89 jydGPCW7fa2bZpU4
## 8 Car Ponten ¥41.66 pytVHesNZjAL8WKc
## 9 Tades Checcucci €70,80 Rsw4EQGk9tKTnzDp
## 10 Wilton Kemmey €62,76 KvrNGQ7yL3pfsaZA
## # ... with 990 more rows
8.2.2 Data Dictionary
- domain: dummy website domain
- imageurl: url of an image
- email: dummy email id
- filename: dummy file name with different extensions
- phone: dummy phone number
- address: dummy address with door and street names
- url: randomyly generated urls
- full_name: dummy first and last names
- currency: different currencies
- passwords: dummy passwords
8.3 Overview
Before we start with the case study, let us take a quick tour of stringr and introduce
ourselves to some of the functions we will be using later in the case study. One of the
columns in the case study data is email
. It contains random email ids. We want to ensure
that the email ids adher to a particular format .i.e
- they contain
@
- they contain only one
@
Let us first detect if the email ids contain @
. Since the data set has 1000 rows, we will
use a smaller sample in the examples.
<- slice(mockstring, 1:10)
mockdata mockdata
## # A tibble: 10 x 12
## id
## <dbl>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
## image_url
## <chr>
## 1 https://robohash.org/providentassumendaexplicabo.jpg?size=50x50&set=set1
## 2 https://robohash.org/etillumvoluptate.jpg?size=50x50&set=set1
## 3 https://robohash.org/nonoptiovoluptatibus.jpg?size=50x50&set=set1
## 4 https://robohash.org/voluptatumauthic.jpg?size=50x50&set=set1
## 5 https://robohash.org/placeaterrorqui.jpg?size=50x50&set=set1
## 6 https://robohash.org/temporeutea.jpg?size=50x50&set=set1
## 7 https://robohash.org/maximesaepequi.bmp?size=50x50&set=set1
## 8 https://robohash.org/nemoautesse.png?size=50x50&set=set1
## 9 https://robohash.org/odiorerumaut.png?size=50x50&set=set1
## 10 https://robohash.org/omnismolestiaearchitecto.png?size=50x50&set=set1
## domain imageurl
## <chr> <chr>
## 1 addtoany.com http://dummyimage.com/130x183.jpg/dddddd/000000
## 2 gmpg.org http://dummyimage.com/106x217.bmp/dddddd/000000
## 3 samsung.com http://dummyimage.com/146x127.bmp/cc0000/ffffff
## 4 spotify.com http://dummyimage.com/181x194.png/5fa2dd/ffffff
## 5 wunderground.com http://dummyimage.com/220x123.jpg/ff4444/ffffff
## 6 alexa.com http://dummyimage.com/118x176.bmp/dddddd/000000
## 7 google.it http://dummyimage.com/185x202.jpg/ff4444/ffffff
## 8 ed.gov http://dummyimage.com/223x163.jpg/ff4444/ffffff
## 9 jigsy.com http://dummyimage.com/145x113.jpg/5fa2dd/ffffff
## 10 jugem.jp http://dummyimage.com/238x214.png/cc0000/ffffff
## email filename phone
## <chr> <chr> <chr>
## 1 mnewburn0@fastcompany.com PedeMalesuada.xls 66-(777)902-6181
## 2 mdankersley1@digg.com LobortisVel.mp3 351-(422)736-6807
## 3 hgirhard2@altervista.org CongueDiamId.pdf 33-(371)684-5114
## 4 pmcmenamy3@sciencedirect.com EleifendQuam.avi 86-(410)823-6712
## 5 drisbrough4@bandcamp.com PurusPhasellus.mp3 223-(518)814-6361
## 6 cphlippi5@surveymonkey.com ElementumInHac.avi 420-(760)354-8671
## 7 kdodswell6@un.org Mattis.doc 1-(712)615-2879
## 8 vhourihane7@ovh.net PurusEu.tiff 62-(437)705-1118
## 9 rdike8@timesonline.co.uk JustoEtiamPretium.xls 1-(683)965-1323
## 10 tdudbridge9@clickbank.net Ante.tiff 30-(553)559-7448
## address
## <chr>
## 1 8 Anhalt Crossing
## 2 697 East Avenue
## 3 89 Dottie Circle
## 4 98135 Blue Bill Park Drive
## 5 7814 Pennsylvania Street
## 6 4897 Little Fleur Drive
## 7 53541 Morrow Center
## 8 4819 Hermina Parkway
## 9 68096 Monument Park
## 10 9595 Spaight Avenue
## url
## <chr>
## 1 https://engadget.com/nascetur/ridiculus/mus/vivamus/vestibulum.jsp?eu=est&ti~
## 2 http://delicious.com/phasellus/in/felis/donec.json?interdum=risus&mauris=dap~
## 3 https://w3.org/sed/augue/aliquam/erat/volutpat.json?dictumst=mi&morbi=sit&ve~
## 4 http://indiatimes.com/pede/lobortis/ligula/sit/amet.jpg?quam=nullam&sollicit~
## 5 https://tumblr.com/id/mauris/vulputate/elementum.png?tincidunt=maecenas&eget~
## 6 https://unblog.fr/est/quam/pharetra.jpg?amet=phasellus&erat=sit&nulla=amet&t~
## 7 http://vinaora.com/posuere.jpg?convallis=in&nulla=faucibus&neque=orci&libero~
## 8 https://globo.com/accumsan.png?elementum=eu&pellentesque=mi&quisque=nulla&po~
## 9 https://xing.com/elementum/eu/interdum/eu/tincidunt.html?sit=proin&amet=eu&s~
## 10 https://bigcartel.com/tortor/quis/turpis/sed/ante/vivamus.html?in=lorem&elei~
## full_name currency passwords
## <chr> <chr> <chr>
## 1 Mufi Ruit ¥34.37 VybPYpEXUjJh6nQk
## 2 Leese Furmagier $67.37 mxET3n6dz42X8YUv
## 3 Blakelee Wilshire €33,85 Z9f4WeNVQ28FwKML
## 4 Terencio McIllrick €42,89 Ndbm8nwCps6jUze3
## 5 Debee McErlaine €13,19 U3Lj9xJw8NHzB5Sg
## 6 Fran Painten ¥87.35 KEhVAC3QNvjWDFJ7
## 7 Frasco Bowich $34.89 jydGPCW7fa2bZpU4
## 8 Car Ponten ¥41.66 pytVHesNZjAL8WKc
## 9 Tades Checcucci €70,80 Rsw4EQGk9tKTnzDp
## 10 Wilton Kemmey €62,76 KvrNGQ7yL3pfsaZA
Use str_detect()
to detect @
and str_count()
to count the number of times
@
appears in the email ids.
# detect @
str_detect(mockdata$email, pattern = "@")
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# count @
str_count(mockdata$email, pattern = "@")
## [1] 1 1 1 1 1 1 1 1 1 1
We can use str_c()
to concatenate strings. Let us add the string email id:
before each
email id in the data set.
str_c("email id:", mockdata$email)
## [1] "email id:mnewburn0@fastcompany.com"
## [2] "email id:mdankersley1@digg.com"
## [3] "email id:hgirhard2@altervista.org"
## [4] "email id:pmcmenamy3@sciencedirect.com"
## [5] "email id:drisbrough4@bandcamp.com"
## [6] "email id:cphlippi5@surveymonkey.com"
## [7] "email id:kdodswell6@un.org"
## [8] "email id:vhourihane7@ovh.net"
## [9] "email id:rdike8@timesonline.co.uk"
## [10] "email id:tdudbridge9@clickbank.net"
If we want to split a string into two parts using a particular pattern, we use str_split()
.
Let us split the domain name and extension from the domain column in the data. The domain name
and extension are separated by .
and we will use it to split the domain column. Since .
is
a special character, we will use two slashes to escape the special character.
str_split(mockdata$domain, pattern = "\\.")
## [[1]]
## [1] "addtoany" "com"
##
## [[2]]
## [1] "gmpg" "org"
##
## [[3]]
## [1] "samsung" "com"
##
## [[4]]
## [1] "spotify" "com"
##
## [[5]]
## [1] "wunderground" "com"
##
## [[6]]
## [1] "alexa" "com"
##
## [[7]]
## [1] "google" "it"
##
## [[8]]
## [1] "ed" "gov"
##
## [[9]]
## [1] "jigsy" "com"
##
## [[10]]
## [1] "jugem" "jp"
We can truncate a string using str_trunc()
. The default truncation happens at the beggining
of the string but we can truncate the central part or the end of the string as well.
str_trunc(mockdata$email, width = 10)
## [1] "mnewbur..." "mdanker..." "hgirhar..." "pmcmena..." "drisbro..."
## [6] "cphlipp..." "kdodswe..." "vhourih..." "rdike8@..." "tdudbri..."
str_trunc(mockdata$email, width = 10, side = "left")
## [1] "...any.com" "...igg.com" "...sta.org" "...ect.com" "...amp.com"
## [6] "...key.com" "...@un.org" "...ovh.net" "...e.co.uk" "...ank.net"
str_trunc(mockdata$email, width = 10, side = "center")
## [1] "mnew...com" "mdan...com" "hgir...org" "pmcm...com" "dris...com"
## [6] "cphl...com" "kdod...org" "vhou...net" "rdik....uk" "tdud...net"
Strings can be sorted using str_sort()
. Let us quickly sort the emails in both
ascending and descending orders.
str_sort(mockdata$email)
## [1] "cphlippi5@surveymonkey.com" "drisbrough4@bandcamp.com"
## [3] "hgirhard2@altervista.org" "kdodswell6@un.org"
## [5] "mdankersley1@digg.com" "mnewburn0@fastcompany.com"
## [7] "pmcmenamy3@sciencedirect.com" "rdike8@timesonline.co.uk"
## [9] "tdudbridge9@clickbank.net" "vhourihane7@ovh.net"
str_sort(mockdata$email, decreasing = TRUE)
## [1] "vhourihane7@ovh.net" "tdudbridge9@clickbank.net"
## [3] "rdike8@timesonline.co.uk" "pmcmenamy3@sciencedirect.com"
## [5] "mnewburn0@fastcompany.com" "mdankersley1@digg.com"
## [7] "kdodswell6@un.org" "hgirhard2@altervista.org"
## [9] "drisbrough4@bandcamp.com" "cphlippi5@surveymonkey.com"
The case of a string can be changed to upper, lower or title case as shown below.
str_to_upper(mockdata$full_name)
## [1] "MUFI RUIT" "LEESE FURMAGIER" "BLAKELEE WILSHIRE"
## [4] "TERENCIO MCILLRICK" "DEBEE MCERLAINE" "FRAN PAINTEN"
## [7] "FRASCO BOWICH" "CAR PONTEN" "TADES CHECCUCCI"
## [10] "WILTON KEMMEY"
str_to_lower(mockdata$full_name)
## [1] "mufi ruit" "leese furmagier" "blakelee wilshire"
## [4] "terencio mcillrick" "debee mcerlaine" "fran painten"
## [7] "frasco bowich" "car ponten" "tades checcucci"
## [10] "wilton kemmey"
Parts of a string can be replaced using str_replace()
. In the address
column of the data set,
let us replace:
- Street with ST
- Road with RD
str_replace(mockdata$address, "Street", "ST")
## [1] "8 Anhalt Crossing" "697 East Avenue"
## [3] "89 Dottie Circle" "98135 Blue Bill Park Drive"
## [5] "7814 Pennsylvania ST" "4897 Little Fleur Drive"
## [7] "53541 Morrow Center" "4819 Hermina Parkway"
## [9] "68096 Monument Park" "9595 Spaight Avenue"
str_replace(mockdata$address, "Road", "RD")
## [1] "8 Anhalt Crossing" "697 East Avenue"
## [3] "89 Dottie Circle" "98135 Blue Bill Park Drive"
## [5] "7814 Pennsylvania Street" "4897 Little Fleur Drive"
## [7] "53541 Morrow Center" "4819 Hermina Parkway"
## [9] "68096 Monument Park" "9595 Spaight Avenue"
We can extract parts of the string that match a particular pattern using str_extract()
.
str_extract(mockdata$email, pattern = "org")
## [1] NA NA "org" NA NA NA "org" NA NA NA
Before we extract, we need to know whether the string contains text that match our pattern.
Use str_match()
to see if the pattern is present in the string.
str_match(mockdata$email, pattern = "org")
## [,1]
## [1,] NA
## [2,] NA
## [3,] "org"
## [4,] NA
## [5,] NA
## [6,] NA
## [7,] "org"
## [8,] NA
## [9,] NA
## [10,] NA
If we are dealing with a character vector and know that the pattern we are looking at
is present in the vector, we might want to know the index of the strings in which it is
present. Use str_which()
to identify the index of the strings that match our pattern.
str_which(mockdata$email, pattern = "org")
## [1] 3 7
Another objective might be to locate the position of the pattern we are looking for in the
string. For example, if we want to know the position of @
in the email ids, we can use
str_locate()
.
str_locate(mockdata$email, pattern = "@")
## start end
## [1,] 10 10
## [2,] 13 13
## [3,] 10 10
## [4,] 11 11
## [5,] 12 12
## [6,] 10 10
## [7,] 11 11
## [8,] 12 12
## [9,] 7 7
## [10,] 12 12
The length of the string can be computed using str_length()
. Let us ensure that the length
of the strings in the password
column is 16.
str_length(mockdata$passwords)
## [1] 16 16 16 16 16 16 16 16 16 16
We can extract parts of a string by specifying the starting and ending position using
str_sub()
. Let us extract the currency type from the currency
column.
str_sub(mockdata$currency, start = 1, end = 1)
## [1] "¥" "$" "\200" "\200" "\200" "¥" "$" "¥" "\200" "\200"
One final function that we will look at before the case study is word()
. It extracts
word(s) from sentences. We do not have any sentences in the data set, but let us use it
to extract the first and last name from the full_name
column.
word(mockdata$full_name, 1)
## [1] "Mufi" "Leese" "Blakelee" "Terencio" "Debee" "Fran"
## [7] "Frasco" "Car" "Tades" "Wilton"
word(mockdata$full_name, 2)
## [1] "Ruit" "Furmagier" "Wilshire" "McIllrick" "McErlaine" "Painten"
## [7] "Bowich" "Ponten" "Checcucci" "Kemmey"
Alright, now let us apply what we have learned so far to our case study.
8.4 Extract domain name from email ids
8.4.1 Steps
- split email using pattern
@
- extract the second element from the resulting list
- split the above using pattern
\\.
- extract the first element from the resulting list
Let us take a look at the emails before we extract the domain names.
<-
emails %>%
mockstring pull(email) %>%
head()
emails
## [1] "mnewburn0@fastcompany.com" "mdankersley1@digg.com"
## [3] "hgirhard2@altervista.org" "pmcmenamy3@sciencedirect.com"
## [5] "drisbrough4@bandcamp.com" "cphlippi5@surveymonkey.com"
8.4.1.1 Step 1: Split email using pattern @
.
We will split the email using str_split
. It will split a string
using the pattern supplied. In our case the pattern is @
.
str_split(emails, pattern = '@')
## [[1]]
## [1] "mnewburn0" "fastcompany.com"
##
## [[2]]
## [1] "mdankersley1" "digg.com"
##
## [[3]]
## [1] "hgirhard2" "altervista.org"
##
## [[4]]
## [1] "pmcmenamy3" "sciencedirect.com"
##
## [[5]]
## [1] "drisbrough4" "bandcamp.com"
##
## [[6]]
## [1] "cphlippi5" "surveymonkey.com"
8.4.1.2 Step 2: Extract the second element from the resulting list.
Step 1 returned a list. Each element of the list has two values. The first one is the username and the second is the domain name. Since we are extracting the domain name, we want the second value from each element of the list.
We will use map_chr()
from purrr to extract the domain names. It will
return the second value from each element in the list. Since the domain
name is a string, map_chr()
will return a character vector.
%>%
emails str_split(pattern = '@') %>%
map_chr(2)
## [1] "fastcompany.com" "digg.com" "altervista.org"
## [4] "sciencedirect.com" "bandcamp.com" "surveymonkey.com"
8.4.1.3 Step 3: Split the above using pattern \\.
.
We want the domain name and not the extension. Step 2 returned a
character vector and we need to split the domain name and the domain
extension. They are separated by .
. Since .
is a special character,
we will use \\
before .
to escape it. Let us split the domain
name and domain extension using str_split
and \\.
as the pattern.
%>%
emails str_split(pattern = '@') %>%
map_chr(2) %>%
str_split(pattern = '\\.')
## [[1]]
## [1] "fastcompany" "com"
##
## [[2]]
## [1] "digg" "com"
##
## [[3]]
## [1] "altervista" "org"
##
## [[4]]
## [1] "sciencedirect" "com"
##
## [[5]]
## [1] "bandcamp" "com"
##
## [[6]]
## [1] "surveymonkey" "com"
8.4.1.4 Step 4: Extract the first element from the resulting list.
Now that we have separated the domain name from its extension, let us extract
the first value from each element in the list returned in step 3. We will again
use map_chr
to achieve this.
%>%
emails str_split(pattern = '@') %>%
map_chr(2) %>%
str_split(pattern = '\\.') %>%
map_chr(extract(1))
## [1] "fastcompany" "digg" "altervista" "sciencedirect"
## [5] "bandcamp" "surveymonkey"
8.5 Extract Domain Extension
The below code extracts the domain extension instead of the domain name.
%>%
emails str_split(pattern = '@') %>%
map_chr(2) %>%
str_split(pattern = '\\.', simplify = TRUE) %>%
extract(, 2)
## [1] "com" "com" "org" "com" "com" "com"
8.6 Extract image type from URL
8.6.1 Steps
- split imageurl using pattern
\\.
- extract the third value from each element of the resulting list
- subset the string using the index position
Let us take a look at the URL of the image.
<-
img %>%
mockstring pull(imageurl) %>%
head()
img
## [1] "http://dummyimage.com/130x183.jpg/dddddd/000000"
## [2] "http://dummyimage.com/106x217.bmp/dddddd/000000"
## [3] "http://dummyimage.com/146x127.bmp/cc0000/ffffff"
## [4] "http://dummyimage.com/181x194.png/5fa2dd/ffffff"
## [5] "http://dummyimage.com/220x123.jpg/ff4444/ffffff"
## [6] "http://dummyimage.com/118x176.bmp/dddddd/000000"
8.6.1.1 Step 1: Split imageurl using pattern \\.
Let us split imageurl using str_split
and the pattern \\.
.
str_split(img, pattern = '\\.')
## [[1]]
## [1] "http://dummyimage" "com/130x183" "jpg/dddddd/000000"
##
## [[2]]
## [1] "http://dummyimage" "com/106x217" "bmp/dddddd/000000"
##
## [[3]]
## [1] "http://dummyimage" "com/146x127" "bmp/cc0000/ffffff"
##
## [[4]]
## [1] "http://dummyimage" "com/181x194" "png/5fa2dd/ffffff"
##
## [[5]]
## [1] "http://dummyimage" "com/220x123" "jpg/ff4444/ffffff"
##
## [[6]]
## [1] "http://dummyimage" "com/118x176" "bmp/dddddd/000000"
8.6.1.2 Step 2: Extract the third value from each element of the resulting list
Step 1 returned a list the elements of which have 3 values each. If you
observe the list, the image type is in the 3rd value. We will now
extract the third value from each element of the list using map_chr
.
%>%
img str_split(pattern = '\\.') %>%
map_chr(extract(3))
## [1] "jpg/dddddd/000000" "bmp/dddddd/000000" "bmp/cc0000/ffffff"
## [4] "png/5fa2dd/ffffff" "jpg/ff4444/ffffff" "bmp/dddddd/000000"
8.6.1.3 Step 3: Subset the string using the index position
We can now extract the image type in two ways:
- subset the first 3 characters of the string
- split the string using pattern
/
and extract the first value from the elements of the resulting list
Below is the first method. We know that the image type is 3 characters. So
we use str_sub
to subset the first 3 characters. The index positions
are mentioned using start
and stop
.
%>%
img str_split(pattern = '\\.') %>%
map_chr(extract(3)) %>%
str_sub(start = 1, end = 3)
## [1] "jpg" "bmp" "bmp" "png" "jpg" "bmp"
In case you are not sure about the length of the image type. In such cases,
we will split the string using pattern /
and then use map_chr
to
extract the first value of each element of the resulting list.
%>%
img str_split(pattern = '\\.') %>%
map_chr(extract(3)) %>%
str_split(pattern = '/') %>%
map_chr(extract(1))
## [1] "jpg" "bmp" "bmp" "png" "jpg" "bmp"
8.7 Extract Image Dimesion from URL
8.7.1 Steps
- locate numbers between 0 and 9
- extract part of url starting with image dimension
- split the string using the pattern
\\.
- extract the first element
8.7.1.1 Step 1: Locate numbers between 0 and 9.
Let us inspect the image url. The dimension of the image appears
after the domain extension and there are no numbers in the url
before. We will locate the position or index of the first number
in the url using str_locate()
and using the pattern [0-9]
which instructs to look for any number between and including 0
and 9.
str_locate(img, pattern = "[0-9]")
## start end
## [1,] 23 23
## [2,] 23 23
## [3,] 23 23
## [4,] 23 23
## [5,] 23 23
## [6,] 23 23
8.7.1.2 Step 2: Extract url
We know where the dimension is located in the url. Let us extract the
part of the url that contains the image dimension using str_sub()
.
str_sub(img, start = 23)
## [1] "130x183.jpg/dddddd/000000" "106x217.bmp/dddddd/000000"
## [3] "146x127.bmp/cc0000/ffffff" "181x194.png/5fa2dd/ffffff"
## [5] "220x123.jpg/ff4444/ffffff" "118x176.bmp/dddddd/000000"
8.7.1.3 Step 3: Split the string using the pattern \\.
.
From the previous step, we have the part of the url that
contains the image dimension. To extract the dimension, we
will split it from the rest of the url using str_split()
and using the pattern \\.
as it separates the dimension
and the image extension.
%>%
img str_sub(start = 23) %>%
str_split(pattern = '\\.')
## [[1]]
## [1] "130x183" "jpg/dddddd/000000"
##
## [[2]]
## [1] "106x217" "bmp/dddddd/000000"
##
## [[3]]
## [1] "146x127" "bmp/cc0000/ffffff"
##
## [[4]]
## [1] "181x194" "png/5fa2dd/ffffff"
##
## [[5]]
## [1] "220x123" "jpg/ff4444/ffffff"
##
## [[6]]
## [1] "118x176" "bmp/dddddd/000000"
8.7.1.4 Step 4: Extract the first element.
The above step resulted in a list which contains the
image dimension and the rest of the url. Each element
of the list is a character vector. We want to extract
the first value in the character vector. Let us use
map_chr()
to extract the first value from each
element of the list.
%>%
img str_sub(start = 23) %>%
str_split(pattern = '\\.') %>%
map_chr(extract(1))
## [1] "130x183" "106x217" "146x127" "181x194" "220x123" "118x176"
8.8 Extract HTTP Protocol from URL
<-
url1 %>%
mockstring pull(url) %>%
first()
url1
## [1] "https://engadget.com/nascetur/ridiculus/mus/vivamus/vestibulum.jsp?eu=est&tincidunt=risus&in=auctor&leo=sed&maecenas=tristique&pulvinar=in&lobortis=tempus&est=sit&phasellus=amet&sit=sem&amet=fusce&erat=consequat&nulla=nulla&tempus=nisl&vivamus=nunc&in=nisl&felis=duis&eu=bibendum&sapien=felis&cursus=sed&vestibulum=interdum&proin=venenatis&eu=turpis&mi=enim&nulla=blandit&ac=mi&enim=in&in=porttitor&tempor=pede&turpis=justo&nec=eu&euismod=massa&scelerisque=donec&quam=dapibus&turpis=duis&adipiscing=at&lorem=velit&vitae=eu&mattis=est&nibh=congue&ligula=elementum&nec=in&sem=hac&duis=habitasse&aliquam=platea&convallis=dictumst&nunc=morbi&proin=vestibulum&at=velit&turpis=id&a=pretium&pede=iaculis&posuere=diam&nonummy=erat&integer=fermentum&non=justo&velit=nec&donec=condimentum&diam=neque&neque=sapien&vestibulum=placerat&eget=ante&vulputate=nulla&ut=justo&ultrices=aliquam&vel=quis&augue=turpis&vestibulum=eget&ante=elit&ipsum=sodales&primis=scelerisque&in=mauris&faucibus=sit&orci=amet&luctus=eros&et=suspendisse&ultrices=accumsan&posuere=tortor&cubilia=quis&curae=turpis&donec=sed&pharetra=ante&magna=vivamus&vestibulum=tortor&aliquet=duis&ultrices=mattis&erat=egestas&tortor=metus&sollicitudin=aenean&mi=fermentum&sit=donec"
8.8.1 Steps
- split the url using the pattern
://
- extract the first element
8.8.1.1 Step 1: Split the url using the pattern ://
.
The HTTP protocol is the first part of the url and is
separated from the rest of the url by :
. Let us
split the url using str_split()
and using the
pattern :
. Since :
is a special character, we
will escape it using \\
.
str_split(url1, pattern = '://')
## [[1]]
## [1] "https"
## [2] "engadget.com/nascetur/ridiculus/mus/vivamus/vestibulum.jsp?eu=est&tincidunt=risus&in=auctor&leo=sed&maecenas=tristique&pulvinar=in&lobortis=tempus&est=sit&phasellus=amet&sit=sem&amet=fusce&erat=consequat&nulla=nulla&tempus=nisl&vivamus=nunc&in=nisl&felis=duis&eu=bibendum&sapien=felis&cursus=sed&vestibulum=interdum&proin=venenatis&eu=turpis&mi=enim&nulla=blandit&ac=mi&enim=in&in=porttitor&tempor=pede&turpis=justo&nec=eu&euismod=massa&scelerisque=donec&quam=dapibus&turpis=duis&adipiscing=at&lorem=velit&vitae=eu&mattis=est&nibh=congue&ligula=elementum&nec=in&sem=hac&duis=habitasse&aliquam=platea&convallis=dictumst&nunc=morbi&proin=vestibulum&at=velit&turpis=id&a=pretium&pede=iaculis&posuere=diam&nonummy=erat&integer=fermentum&non=justo&velit=nec&donec=condimentum&diam=neque&neque=sapien&vestibulum=placerat&eget=ante&vulputate=nulla&ut=justo&ultrices=aliquam&vel=quis&augue=turpis&vestibulum=eget&ante=elit&ipsum=sodales&primis=scelerisque&in=mauris&faucibus=sit&orci=amet&luctus=eros&et=suspendisse&ultrices=accumsan&posuere=tortor&cubilia=quis&curae=turpis&donec=sed&pharetra=ante&magna=vivamus&vestibulum=tortor&aliquet=duis&ultrices=mattis&erat=egestas&tortor=metus&sollicitudin=aenean&mi=fermentum&sit=donec"
8.8.1.2 Step 2: Extract the first element.
The HTTP protocol is the first value in each element
of the list. As we did in the previous example, we
will extact it using map_chr()
and extract()
.
%>%
url1 str_split(pattern = '://') %>%
map_chr(extract(1))
## [1] "https"
8.9 Extract file type
<-
urls %>%
mockstring use_series(url) %>%
extract(1:3)
8.9.1 Steps
- check if there are only 2 dots in the URL
- check if there is only 1 question mark in the URL
- detect the staritng position of file type
- tetect the ending position of file type
- use the locations to specify the index position for extracting file type
8.9.1.1 Step 1: Check if there are only 2 dots in the URL
Let us locate all the dots in the url using str_locate_all()
and see
if any of them contain more than 2 dots.
%>%
urls str_locate_all(pattern = '\\.') %>%
map_int(nrow) %>%
is_greater_than(2) %>%
sum()
## [1] 0
8.9.1.2 Step 2: Check if there is only 1 question mark in the URL
The next step is to check if there is only one ?
(question mark)
in the url.
%>%
urls str_locate_all(pattern = "[?]") %>%
map_int(nrow) %>%
is_greater_than(1) %>%
sum()
## [1] 0
8.9.1.3 Step 3: Detect the staritng position of file type
Since the file type is located between the second dot and the first quesiton mark in the url, let us extract the location of the second dot and add 1 as the file type starts after the dot.
<-
d %>%
urls str_locate_all(pattern = '\\.') %>%
map_int(extract(2)) %>%
add(1)
d
## [1] 64 47 48
8.9.1.4 Step 4: Detect the ending position of file type
In step 2, we confirmed that the url has only one question mark. Let us locate the question mark in the url and subtract 1 (as the file type ends before the question mark) so that we get the ending chapterion of the file type. .
<-
q %>%
urls str_locate_all(pattern = "[?]") %>%
map_int(extract(1)) %>%
subtract(1)
q
## [1] 66 50 51
8.9.1.5 Step 5: Specify the index position for extracting file type
From steps 3 and 4, we have the location of the second dot and the
first question mark in the url. Let us use them with str_sub()
to extract the file type.
str_sub(urls, start = d, end = q)
## [1] "jsp" "json" "json"