XML / HTML tables / Web scraping (static HTML)
Data import is a surprisingly large part of any data project. Below is a compact but practical reference of common file types, recommended packages/functions, example code, and real-world tips (encoding, memory, speed, cloud, DBs, troubleshooting).
Quick workspace prep (repeat — but important)
# Check working dirgetwd()# Set working dir (if you must)setwd("/path/to/project")# Start with a clean environment (use with care)rm(list = ls())# Helpful: show fileslist.files()
Base R vs modern tidyverse / high-performance packages
- Base R (
read.table,read.csv,read.delim) works everywhere and is flexible. - readr (
read_csv,read_delim) — faster, consistent parsing, tidyverse-friendly. - data.table::fread — extremely fast for large CSV-like files.
- readxl — best for Excel (no external dependencies).
- haven — best for SAS/SPSS/Stata (via ReadStat).
- jsonlite — modern JSON parsing.
- DBI + specific DB backends — standard interface for relational DBs.
- arrow — for Feather/Parquet and interop with Python/Spark; excellent for large data and columnar formats.
CSV / delimited text
Recommended
- Small-medium files:
readr::read_csv() - Very large:
data.table::fread()orvroom::vroom()
Examples
# readr - tidy defaults, prints parsing problemsinstall.packages("readr")library(readr)df <- read_csv("data/sales.csv") # commadf2 <- read_delim("data/data.txt", "\t") # tab separated# data.table: blazing fast, automatic type detectioninstall.packages("data.table")library(data.table)dt <- fread("data/sales.csv")# vroom: loads lazily and is fast for many filesinstall.packages("vroom")library(vroom)df_v <- vroom("data/sales.csv")
Useful arguments
locale = locale(encoding = "UTF-8")— handle encodingscol_types = cols(...)— force column types (avoid mis-parsing)na = c("", "NA", "NULL")— specify missing tokensn_max— read only first N rows (quick inspect)
Excel (XLS / XLSX)
Recommended
readxl::read_excel()— no Java dependency, reliabletidyxl,openxlsx— if you need formulas or writing
Example
install.packages("readxl")library(readxl)# read first sheetdf <- read_excel("data/book.xlsx")# read a specific sheet by name or indexdf_sheet3 <- read_excel("data/book.xlsx", sheet = "Sales")df_sheet2 <- read_excel("data/book.xlsx", sheet = 2)# read a rangedf_range <- read_excel("data/book.xlsx", range = "A1:F100")
JSON
Recommended
jsonlite::fromJSON()— robust, converts arrays to data frames where appropriate
install.packages("jsonlite")library(jsonlite)# from local filej <- fromJSON("data/doc.json", flatten = TRUE)# from URLj2 <- fromJSON("https://api.example.com/data")
Tip: If JSON contains nested lists, use flatten=TRUE or manually tidyr::unnest() to tidy.
XML / HTML tables / Web scraping (static HTML)
Recommended
xml2+rvestfor HTML scraping / table extractionXMLfor more advanced XML parsing (butxml2recommended)
install.packages(c("xml2","rvest"))library(xml2); library(rvest)page <- read_html("https://example.com/page-with-tables")tables <- html_table(html_nodes(page, "table")) # list of tibbles
Excel-like & legacy formats (ODS, Google Sheets)
- Google Sheets:
googlesheets4(supports OAuth & public sheets) - ODS:
readODSpackage
# Google Sheets (best to follow auth flow)install.packages("googlesheets4")library(googlesheets4)sheet <- read_sheet("https://docs.google.com/spreadsheets/...")
SAS / SPSS / Stata
Recommended
haven(fast, preserves labelled variables)foreignis older;havenpreferred now
install.packages("haven")library(haven)df_sas <- read_sas("data/data.sas7bdat")df_spss <- read_sav("data/data.sav")df_stata <- read_dta("data/data.dta")
Tip: haven keeps value labels (labelled class). Use labelled::to_factor() to convert labeled columns to factors.
MATLAB / Octave
R.matlab::readMat()for.matfilesforeign::read.octave()for Octave text data
install.packages("R.matlab")library(R.matlab)m <- readMat("data/matrix.mat")
Parquet, Feather, Arrow — modern columnar formats
- Use
arrowpackage to read/write Parquet and Feather; very efficient for large datasets and cross-language sharing.
install.packages("arrow")library(arrow)# read parquettbl <- read_parquet("data/data.parquet")# write parquetwrite_parquet(df, "out/data.parquet")
Benefit: Arrow allows out-of-memory, zero-copy reads, and is ideal for large-scale pipelines.
ODBC / Relational DBs (SQL Server, Postgres, MySQL, etc.)
Recommended approach
DBI+ driver package (RPostgres,RMySQL,odbc) — standard interfacepoolfor connection pooling in apps
Example: Postgres via DBI
install.packages(c("DBI","RPostgres"))library(DBI)con <- dbConnect(RPostgres::Postgres(), dbname = "mydb", host = "db.example.com", port = 5432, user = "me", password = "pw")# run a querydf <- dbGetQuery(con, "SELECT * FROM schema.table LIMIT 1000")# fetch a big table in chunksres <- dbSendQuery(con, "SELECT * FROM big_table")chunk <- dbFetch(res, n = 10000)# iterate until exhausteddbClearResult(res)dbDisconnect(con)
ODBC (Windows / DSN)
install.packages("odbc")library(DBI)con <- dbConnect(odbc::odbc(), DSN = "my_dsn", UID = "user", PWD = "pw")dbListTables(con)dbDisconnect(con)
Tip: Avoid SELECT * on huge tables — use column selection and SQL-side filtering.
Cloud storage (S3, GCS, Azure)
- S3:
aws.s3,arrowcan read directly from S3 URIs - Google Cloud Storage:
gcsfs(less common in R), or usearrow/gcsintegration - Azure:
AzureStor
Example with aws.s3:
install.packages("aws.s3")library(aws.s3)# must configure AWS credentials (env vars or ~/.aws/credentials)s3read_using(FUN = read.csv, object = "s3://my-bucket/data.csv")
arrow::read_parquet("s3://bucket/path/file.parquet") also works when configured correctly.
Reading in chunks / streaming large files
If data is too big to fit in memory:
- Use database imports (load into DB and query subsets)
- Use
data.table::fread()withselect=to read only columns - Use
vroom+dplyrfor delayed reads - For complex chunking:
readr::read_lines_chunked()or process viaLaForiotoolspackages
Example chunked processing with LaF:
install.packages("LaF")library(LaF)laf <- laf_open_csv("big.csv", column_types = c("integer","double","string"), nrows = 1e8)block <- laf[1:100000, ] # read first 100k rows
Parsing dates, times, and types
Parsing types properly avoids nasty surprises later.
- Use
col_typesinreadrorfread'scolClasses - Use
lubridatefor robust date parsing
install.packages("lubridate")library(lubridate)# convert common formatsdf$date <- ymd(df$date_string) # "2023-07-01"df$datetime <- ymd_hms(df$ts) # "2023-07-01 08:45:00"
Encoding & locale issues
If you see weird characters:
# readrdf <- read_csv("file.csv", locale = locale(encoding = "latin1"))# base read.tabledf <- read.table("file.txt", fileEncoding = "UTF-8")
Check Encoding() and iconv() to convert.
Common import pitfalls and troubleshooting
- Headers misaligned / comments: use
skip = n,comment = "#", orskip_empty_rows = TRUE. - Inconsistent column types: pre-specify column types (using
col_typesorcolClasses) to avoid type coercion. - NA tokens: set
na = c("", "NA", "n/a"). - Thousands separators / decimal: specify
locale = locale(decimal_mark = ",", grouping_mark = "."). - Memory errors: filter on SQL side, load a subset, or use
arrow/Parquet to read columns only.
Quick hacks & productivity tips
- Clipboard: quick copy-paste into R (Windows/Mac):
df <- read.table("clipboard", header = TRUE, sep = "\t")
- Peek first rows before full import:
readr::read_lines("file.csv", n_max = 10)
- Inspect parsing problems with
readr::problems()afterread_csv(). - Use
col_types = cols()to force types and prevent guessing mistakes. - Use projects & relative paths (RStudio projects) to avoid brittle
setwd().
Recommended import pipeline (small project)
- Inspect the file (
read_lines,head,file.info()). - Try a small sample read (
n_max,skip). - Specify explicit
col_typesorcolClasses. - Load with a high-performance reader (
fread/vroom/arrow) for big data. - Convert dates and factor levels with
lubridateandforcats. - Persist as Parquet (
arrow) if reusing:arrow::write_parquet(df, "data.parquet").
Example: end-to-end — CSV → cleaned parquet
library(data.table)library(lubridate)library(arrow)# fast readdt <- fread("raw/sales.csv", na.strings = c("", "NA"))# clean typesdt[, order_date := ymd(order_date)]dt[, amount := as.numeric(amount)]# sample checkhead(dt)# persist for faster reads laterwrite_parquet(dt, "clean/sales.parquet")
Useful packages summary (short cheat-sheet)
TaskPackage / Function
Fast CSV read
data.table::fread()
Tidy read
readr::read_csv()
Excel
readxl::read_excel()
JSON
jsonlite::fromJSON()
SPSS/Stata/SAS
haven::read_sav(), read_dta(), read_sas()
Parquet / Arrow
arrow::read_parquet() / write_parquet()
DB connections
DBI + RPostgres / odbc
S3
aws.s3, arrow
Google Sheets
googlesheets4
HTML tables
rvest::html_table()
Chunked reading
LaF, readr::read_lines_chunked()
Labelled variables
labelled
Final notes & next steps
- Choose the right tool for your dataset size:
fread/vroom/arrowfor big files;readr/readxlfor comfort and parsing diagnostics. - Preserve raw files and create a reproducible import script (save
col_types, date parsing logic). - For production pipelines, prefer Parquet/Arrow or a DB as canonical storage (fast, typed, efficient).
- If you want, I can convert any of your examples above into a runnable RMarkdown notebook with sample files and tests.
At Perceptive Analytics, our mission is “to enable businesses to unlock value in data.” For over 20 years, we’ve partnered with more than 100 clients—from Fortune 500 companies to mid-sized firms—to solve complex data analytics challenges. Our services include tableau consultancy and advanced analytics consulting, turning data into strategic insight. We would love to talk to you. Do reach out to us.
Comments
Post a Comment