Let me use an example to illustrate this topic:
A chinese character: 汉
it's unicode value: U+6C49
convert 6C49 to binary: 01101100 01001001
Nothing magical so far, it's very simple. Now, let's say we decide to store this character on our hard drive. To do that, we need to store the character in binary format. We can simply store it as is '01101100 01001001'. Done!
But wait a minute, is '01101100 01001001' one character or two characters? You knew this is one character because I told you, but when a computer reads it, it has no idea. So we need some sort of "encoding" to tell the computer to treat it as one.
This is where the rules of 'UTF-8' comes in: http://www.fileformat.info/info/unicode/utf8.htm
Binary format of bytes in sequence
1st Byte 2nd Byte 3rd Byte 4th Byte Number of Free Bits Maximum Expressible Unicode Value
0xxxxxxx 7 007F hex (127)
110xxxxx 10xxxxxx (5+6)=11 07FF hex (2047)
1110xxxx 10xxxxxx 10xxxxxx (4+6+6)=16 FFFF hex (65535)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (3+6+6+6)=21 10FFFF hex (1,114,111)
According to the table above, if we want to store this character using the 'UTF-8' format, we need to prefix our character with some 'headers'. Our chinese character is 16 bits long (count the binary value yourself), so we will use the format on row 3 as it provides enough space:
Header Place holder Fill in our Binary Result
1110 xxxx 0110 11100110
10 xxxxxx 110001 10110001
10 xxxxxx 001001 10001001
Writing out the result in one line:
11100110 10110001 10001001
This is the UTF-8 (binary) value of the chinese character! (confirm it yourself)
Summary
A chinese character: 汉
it's unicode value: U+6C49
convert 6C49 to binary: 01101100 01001001
UTF-8 binary: 11100110 10110001 10001001
用Python 2.x会经常碰到一个错误:
UnicodeEncodeError: 'ascii' codec can't encode character ... : ordinal not in range(128)
搞清这个问题之前要先理解三个知识点:
- UTF-8 vs Unicode
- Encoding vs Decoding
- Python 2.7里的 str 和 unicode
1. UTF-8 vs Unicode
这一点已经在[之前的博文]里解释过了(http://cheng.logdown.com/posts/2015/01/14/utf-8-vs-unicode:),这里我来总结一下
- Unicode 只是一个文字与数字之间的映射。比如,'汉' 这个字在Unicode里的代码是 ‘6C49’,想对应的数字就是 27721。地球上每种语言里的每一个文字都有这样一个相对应的数字标识。这个文字与数字的映射表就是 Unicode。
- 当我们把这个映射后的数字存储在计算机上时,需要把它转换成 0 和 1. 我们可以简单的把 27721 转换成二进制代码 ’01101100 01001001‘ 来存储。但问题是计算机怎么能够知道这个两个字节是代表一个文字而不是两个文字? 这个时候就需要再有一种编码形式来告诉计算机将这个字节视为一个文字。这个编码就是UTF-8 (当然,UTF-8只是众多编码中的一种)
可以用这个顺序来理解:
屏幕上看到的文字 Unicode代码 根据UTF-8规范存在计算机内存或者硬盘里的模样
汉 -> 6C49 -> 11100110 10110001 10001001
2. Encoding vs Decoding
在Python中把一个Unicode类转化为 0 和 1 的过程叫做Encoding。 把 0 和 1 反转为Unicode类的过程叫做Decoding。
在Python 2.7版本里,ASCII是默认的Encoding和Decoding的方法。
3. Python 2.7里的 str 和 unicode
当你把一个字符(不管该字符是英文字母还是ASCII里不能包含的字符)赋值给一个变量时:
han = '汉'
这个变量的类型都会是str:
In [113]:han = '汉'
type(han)
Out[113]:str
但这里有很重要的一点需要理解:
In [124]:han = '汉'
bin_han = '\xe6\xb1\x89'
han == bin_han # 虽然在界面里我们看到的是'汉'这个字,但其实它是一堆字符,并不是我们看到的文字
Out[124]:True
str这个类的本质其实就是原始字符数据(raw byte data)。它并不是我们所看到的'汉'!
那么Unicode类也是这样吗?
In [114]:uni_han= u'汉'
type(uni_han)
Out[114]:unicode
In [131]:uni_han= u'汉'
u_han = u'\u6c49'
uni_han == u_han # 在Unicode中存储的是u'\u6c49'而不是你所看到的u'汉'
Out[131]:True
理解了以上三个知识点,我们就可以很容易的解释 'ascii' codec can't encode character 这个错误的缘由了。
用示例来解释 'ascii' codec can't encode character
In [117]:han = '汉'
print type(han)
print len(han)
str(han)
Out[117]:<type 'str'>
3 <- '汉'的长度是3,明明是一个字,为什么长度是3?
'\xe6\xb1\x89' <- 答案在这里
当'汉'这个字被存储在内存中时,它会被转为三个字符'\xe6\xb1\x89'。所以len()给出的长度是3,而不是1. 那么为了让'汉'变成一个真正的字,我们就需要对它进行Decoding。(参看2. 把 0 和 1 转换为Unicode的过程叫Decoding)
In [125]:str.decode(han)
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-125-3ff96a3a19da> in <module>()
----> 1 str.encode(han)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)
这里Python抛出了异常。因为默认的ASCII编码无法Decode这个文字。因为这个文字的数值已经超过了0 - 127这个范围。所以我们需要使用UTF-8编码来Decode:
In [127]:str.decode(han, 'utf8')
Out[127]:u'\u6c49'
这里han这个变量被成功Decode为 u'\u6c49
In [142]:uni_han = u'\u6c49'
len(uni_han)
Out[142]:1 <- 长度变为了正确的1
再来个示例作为结尾
猜猜这段代码的输出是什么:
uni_han = u'\u6c49'
print '\'{0}\'的长度是{1}'.format(uni_han, len(uni_han))
结果是:
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-143-9bec6fa25583> in <module>()
1 uni_han = u'\u6c49'
----> 2 print '\'{0}\'的长度是{1}'.format(uni_han, len(uni_han))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u6c49' in position 0: ordinal not in range(128)
好伤心啊,本以为再也不会碰到这个问题了。那么问题出在哪呢?这部分代码'\'{0}\'的长度是{1}'是str,也就是原始的字符数据。我们想把一个Unicode (uni_han)混到它们里一起打印。这时,Python信心满满的用了默认的ASCII编码来Encode uni_han。结果可想而知,又是再次超出0 - 127的范围,无法Encode。这时,我们就需要告诉Python放弃ASCII吧,请使用UTF-8:
In [145]:uni_han = u'\u6c49'
print '\'{0}\'的长度是{1}'.format(unicode.encode(uni_han,'utf8'), len(uni_han))
'汉'的长度是1
另一种方法是让前半部分的str变为Unicode:
In [150]:uni_han = u'\u6c49'
print u'\'{0}\'的长度是{1}'.format(uni_han, len(uni_han))
'汉'的长度是1
总结
在Python 2.x里str就是原始的010101, Unicode是Unicode,这两个东西不能混着用。当一个文字被写到硬盘上时,或者打印到屏幕上时,需要使用正确的Encoding编码。Python默认使用ASCII,但其实应该用UTF-8。这个问题以后还会经常碰到。关键是要理解ASCII,UTF8,Unicode, Encoding和Decoding的定义和关系。
<img src='image.png'>
The Markdown syntax won't work in this case, you have to use raw html like above.
Reference:
http://stackoverflow.com/questions/10628262/inserting-image-into-ipython-notebook-markdown
检查Tuple里的每个元素
假设有一个Tuple,里面包含了几个元素:
p = (170, 0.1, 0.6)
if p[1] >= 0.5:
print u'好深的'
if p[2] >= 0.5:
print u'好亮啊'
这段代码本身没有任何问题,但是写的时候需要记住Tuple里每个元素都是什么,才能打印出对的描述。为了让代码更容易看懂:
from collections import namedtuple
Color = namedtuple('Color', ['hue', 'saturation', 'luminosity'])
p = Color(170, 0.1, 0.6)
if p['saturation'] >= 0.5:
print u'好深的'
if p['luminosity'] >= 0.5:
print u'好亮啊'
计算列表里的重复元素
假设有一个叫做颜色的列表, 需要计算出这个列表里每个颜色的名字被重复了几次
colors = ['red', 'green', 'red', 'blue', 'green', 'red']
d = {}
一般书写方式:
for color in colors:
if color not in d:
d[color] = 0
d[color] += 1
稍好一点的写法:
for color in colors:
d[color] = d.get(color, 0) + 1
最好的写法:
from collections import defaultdict
d = defaultdict(int)
for color in colors:
d[color] += 1
将一个字典里的内容归类
有一个列表,需要将列表中的内容根据长度归类
names = ['raymond', 'rachel', 'matthew', 'roger', 'bettry', 'melissa', 'judith', 'charlie']
一般写法:
d = {}
for name in names:
key = len(name)
if key not in d:
d[key] = []
d[key].append(name)
稍好一点的写法:
for name in names:
key = len(name)
d.setdefault(key, []).append(name)
最好的写法:
d = defaultdict(list)
for name in names:
key = len(name)
d[key].append(name)
使用Keyword Argument
tw('@obama', False, 20, True)
如果不看ts函数的内容的话,是无法理解这个函数是干什么用的,如果改写成这样呢:
twitter_search('@obama', retweets=False, numtweets=20, popular=True)
同时更新多个变量
编程的时候经常会碰到这种情况,需要用一个临时的变量来存住一个数值,然后过一会再把这个数值取出来
t = y
y = x + y
x = t
最好的写法:
x, y = y, x+y
所有等号右侧的数值都是旧的数值。这个写法的好处是不需要像原来那样担心每一行顺序的问题。
对序列起始位置做改动
当改动序列第一位的元素时,经常会引起程序速度变慢
names = ['raymond', 'rachel', 'matthew', 'roger', 'bettry', 'melissa', 'judith', 'charlie']
#以下任意操作都会很慢
del names[0]
names.pop(0)
names.insert(0, 'mark')
最好的方式:
from collections import deque
#将names变为可以在左右两端添加或删减的数据类型
names = deque(['raymond', 'rachel', 'matthew', 'roger', 'bettry', 'melissa', 'judith', 'charlie'])
引自:
https://www.youtube.com/watch?v=OSGv2VnC0go
https://www.youtube.com/watch?v=wf-BqAjZb8M
I had to deploy a django project on CentOS 6.5 and here are the steps to set up Python and then Django. The problem with CentOS 6.5 is that it comes with python 2.6.6 and it's package installer yum relies on it.
Although we can install python 2.7.x and soft link 2.7.x as the system default, it will break yum and who knows what else relies on 2.6.6. Thus, I am installing python 2.7.x via pyenv, which is suggested by this blog post.
This solution is cleaner as it does not temper with the OS default python 2.6.6.
- Update all of yum's package
yum -y update
- Install pyenv, which works like rbenv:
curl -L https://raw.githubusercontent.com/yyuu/pyenv-installer/master/bin/pyenv-installer | bash
Put the following lines into your .bashrc file:
export PATH="$HOME/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
Then, reload bashrc:
. .bashrc
This will enable the pyenv command in bash.
- Install these packages before installing python:
yum install zlib-devel bzip2 bzip2-devel readline-devel sqlite sqlite-devel mysql-devel
- Use this command to list all of the 2.7.x version of python (grep 3.x for python 3):
pyenv install -l | grep 2.7
- Install the python version you prefer:
pyenv install 2.7.10
- Use the newly installed version of python:
pyenv local 2.7.10
You can list all python versions abailable to pyenv by:
pyenv versions
- Update pip to its latest version:
pip install -U pip
Now, we have everything setup and ready to go. The natural thing to do next is to install virtualenv and virtualenvwrapper:
Installing virtualenv and virtualenvwrapper
WARNING: DO NOT install virtualenv and virtualenvwrapper !!!
Let me tell you what happend after I installed both via pip. I put the following lines in .bashrc
export WORKON_HOME=$HOME/.virtualenvs
source ~/.pyenv/shims/virtualenvwrapper.sh #virtualenvwrapper.sh got installed to this dir instead of /usr/local/bin/
Then, I was disconnected from the cloud server with the following warning:
pyenv: -bash: command not found
It doesn't matter what I try, I cannot connect to the cloud server anymore because every time I login successfully, I get disconnected from the server with the same error above. I cannot even login using the VNC connection provided by the cloud service provider. The only option I had was to reinstall the image on the cloud server...
I cannot find the cause of this issue on Google, but from the look of it, I messed out the path to bash so everytime I logged in, cannot find bash -> disconnect
Don't worry you can still use virtualenv
If you look under the .pyenv/plugins directory, you can see a directory named pyenv-virtualenv. This is what we can use to create virtual environments.
To create a test environment:
pyenv virtualenv test # create a virtualenv named test based on the current local or system version of python
Since we have already set our local version of python to be 2.7.10, this test env inherits from that. But if you did not set the local version, the test env will inherit from the system version of python which may not be what you want. (Note: while running the command above, the 'virtualenv' package is installed to your local version of python)
You can also specify which version of python to use (by putting a version number after 'pyenv virutalenv'):
pyenv virtualenv 2.6.6 test # use version 2.6.6 for this environment
Once the test environment has been setup, you can verify it by:
pyenv versions
You should see something similar to this:
system
* 2.7.10 (set by /.python-version)
test
You can switch to the test env by:
pyenv activate test
To deactivate:
pyenv deactivate
To list virtualenvs:
pyenv virtualenvs
To delete existing virtualenvs:
pyenv uninstall my-virtual-env
If you have seen pyenv-virtualenv's github page, you have probably noticed something named pyenv-virtualenvwrapper. Don't install it unless you know what it is. I don't have the time to figure out what it is ATM. But it is definitely NOT the virtualenv + virtualenvwrapper combination you are familiar with. So be cautious!
After updating pip using pip install -U pip,
$ pip
/usr/bin/pip: No such file or directory
pip can no longer be found:
$ which pip
/usr/local/bin/pip
$ pip
-su: /usr/bin/pip: No such file or directory
$ type pip
pip is hashed (/usr/bin/pip)
So pip is definintely in /usr/local/bin/pip but it is been cached as in /usr/bin/pip, thanks to the Stackoverflow question, the solution is very simple:
$ hash -r
When the cache is clear, pip is working again.
I have found a tool that allows you to write and test Javascript quickly in iPython Notebook. The tool is called IJavascript. The project page has a very detailed installation guide for differet OS.
To install it on Mac:
1) Install ijavascript itself via npm:
npm install -g ijavascript
2) Install the dependencies via pip:
sudo pip install --upgrade ipython jinja2 jsonschema pyzmq
3) to run:
ipython ijs
Here is a screenshot of it running:
UPDATE: A video tutorial of this post has been created by webucator. They also offer a set of Python online training classes.
Thanks to this SO post, here are the ways to shallow and deep copy lists in Python:
Shallow Copy (from fastest to slowest)
new_list = old_list[:]
new_list = []
new_list.extend(old_list)
new_list = list(old_list)
import copy
new_list = copy.copy(old_list)
new_list = [i for i in old_list]
new_list = []
for i in old_list:
new_list.append(i)
Deepcopy (the slowest)
import copy
new_list = copy.deepcopy(old_list)
There are some pitfalls when you need to create and login users manually in Django. Let's create a user first:
def view_handler(request):
username = request.POST.get('username', None)
password = request.POST.get('username', None)
Note that request.POST.get('username', None)
should be used instead of request.POST['username']
. If the later is used, you will get this error:
MultiValueDictKeyError
Once the username and password are extracted, let's create the user
User.objects.create(username=username, password=password, email=email) # DON'T DO THIS
The above code is wrong. Because when create
is used instead of create_user
, the user's password is not hashed. You will see the user's password is stored in clear text in the database, which is not the right thing to do.
So you should use the following instead:
User.objects.create_user(username=username, password=password, email=None)
What if you want to test if the user you are about to create has already existed:
user, created = User.objects.get_or_create(username=username, email=None)
if created:
user.set_password(password) # This line will hash the password
user.save() #DO NOT FORGET THIS LINE
get_or_create
will get the existing user or create a new one. Two values are returned, an user object and a boolean flag created
indicating whether if the user created is a new one (i.e. created = True) or an existing one (i.e. created = False)
It is import to not forget including user.save()
in the end. Because set_password
does NOT save the password to the database.
Login
Now a user has been created successfully, the next step is to login.
user = authenticate(username=email, password=password)
login(request, user)
authenticate()
only sets user.backend
to whatever authentication backend Django uses. So the code above is equivlent to:
user.backend = 'django.contrib.auth.backends.ModelBackend'
login(request, user)
Django's documentation recommends the first way of doing it. However, there is an use case for the second approach. When you want to login an user without a password:
username = self.request.GET.get('username')
user = User.objects.get(username=username)
user.backend = 'django.contrib.auth.backends.ModelBackend'
login(request, user)
The is used when security isn't an issue but you still want to distinguish between who's who on your site.
So to sum up the code above, here is the view_handler that manually create and login an user:
def view_handler(request, *args, **kwargs):
email = request.POST.get('email', None)
password = request.POST.get('password', None)
if email and password:
user, created = User.objects.get_or_create(username=email,
email=email)
if created:
user.set_password(password)
user.save()
user = authenticate(username=email, password=password)
login(request, user)
return HttpResponseRedirect('where_ever_should_be_redirect_to')
else:
# return error or redirect to login page again
When writing a Django project, it happens often that mulitple apps will be included. Let me use an example:
Project
- Account
- Journal
In this example, I created a Django project that contains two apps. The Account
app handles user registration and login. The Journal
app allows users to write journals and save it to the database. Here is the what the urls look like:
#ROOT_URLCONF
urlpatterns = [
url(r'^account/', include('Account.urls', namespace='account')),
url(r'^journal/', include('Journal.urls', namespace='journal')), #This namespace name is used later, so just remember we have given everything under journal/ a name
]
This above file is what the ROOT_URLCONF
points to. Inside the Note
app, the urls look like this:
urlpatterns = [
url(r'^(?P<id>[0-9]{4})/$', FormView.as_view(), name = 'detail'),
]
So each journal has a 4 digit id. When a journal is access, it's url may look like this: www.mynote.com/note/1231/
Let's say user John
bookmarked a journal written by another person. He wants to comment on it. When John tries to access that journal www.mynote.com/note/1231/
, he is redirected to the login page. In the login page's view handler, a redirect should be made to Journal ID 1231
once authentication is passed:
def view_handler(request):
# authentication passed
return redirect(reverse('detail', kwargs={'id', '1231'}))
The reverse(...)
statement is not going to work in this case. Because the view_handler belongs to the Account
app. It does not know about the urls inside the Journal
app. To be able to redirect to the detail page of the Journal
app:
reverse('journal:detail', kwargs={'id', '1231'})
So the format for reversing urls that belong to other apps is:
reverse('namespace:name', args, kwargs)